Exploring Efficient Data Movement Strategies for Exascale Systems with Deep Memory Hierarchies Heterogeneous Memory (or) DMEM: Data Movement for hEterogeneous.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Today’s topics Single processors and the Memory Hierarchy

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Battle of the Accelerator Stars Pavan Balaji Computer Scientist Group lead, Programming Models and Runtime Systems Argonne National Laboratory

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

A many-core GPU architecture.. Price, performance, and evolution.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)

Advances in Language Design

Computer System Architectures Computer System Software

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

Hybrid MPI and OpenMP Parallel Programming

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Efficient Intranode Communication in GPU-Accelerated Systems

Outline Why this subject? What is High Performance Computing?

DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.

Review of Computer System Organization. Computer Startup For a computer to start running when it is first powered up, it needs to execute an initial program.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

My Coordinates Office EM G.27 contact time:

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Background Computer System Architectures Computer System Software.

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

CMSC 611: Advanced Computer Architecture

Pipeline parallelism and Multi–GPU Programming

Presentation transcript:

Exploring Efficient Data Movement Strategies for Exascale Systems with Deep Memory Hierarchies Heterogeneous Memory (or) DMEM: Data Movement for hEterogeneous Memory Pavan Balaji (PI), Computer Scientist Antonio Pena, Postdoctoral Researcher Argonne National Laboratory Project Dates: Sep. 2012 to Aug. 2017

System Architecture Complexity Processor heterogeneity is a well known issue Heavy weight general purpose cores Light weight accelerator cores No branch prediction In-order instructions Memory heterogeneity: step-child of the heterogeneous computing era Main memory Scratchpad memory Nonvolatile memory Memory reliability and performance variation (because of power constraints) XStack PI Meeting (03/22/2013)

Managing Heterogeneous Memory Core problem being addressed: Heterogeneous memory is inevitable – all upcoming supercomputers use this in some way or another Applications need to make the leap from using legacy main memory to richer memory domains such as NVRAM, scratchpad memory, accelerator memory, etc. Abstract architectural model: Our view of the system architecture focuses on utilizing different memory systems as directly accessible regions Goals: Each memory has semantic differences that need to be addressed; we want to provide fundamental models for interacting with such memory Efficient end-to-end data motion from any memory to any memory (possibly across coherence domains) Moderated load/store accesses to memory (where applicable) Core Cache Main Memory NVRAM Disk Hierarchical Memory View Scratchpad Memory Main Memory Core Less Reliable Memory NVRAM Accelerator Memory Compute-capable Memory Heterogeneous Memory as First-class Citizens XStack PI Meeting (03/22/2013)

Applications and Heterogeneous Memory: Case Studies Several applications are already looking at utilizing different types of memory regions Computational Chemistry Iterative convergence models allow most iterations to tolerate (infrequent) errors Same concept used in 32-bit/64-bit mixed precision computations Nuclear Physics Green’s Function Monte Carlo simulations rely on large per-process memory footprints for their computations Current computations treat memory as uniform read/write performance units With NVRAM, scientists are considering modifying their algorithms to make them more read-intensive A. E. DePrince, III and J. R. Hammond J. Chem. Theory Comput. 7, 1287 (2011) "Coupled Cluster Theory on Graphics Processing Units I. The Coupled Cluster Doubles Method." XStack PI Meeting (03/22/2013)

Programming Environments in the Heterogeneous Memory Era Memory fragmentation is inevitable Already seen with accelerator memory and scratchpad regions Applications are already embracing heterogeneous memory while taking advantage of the characteristics of each memory domain Programming environments are, unfortunately, falling behind We tend to treat main memory as a “special” memory region, where the primary computation is performed Data movement and coordination is staged in main memory because of this view that main memory is superior in some way Computation relies on the characteristics of main memory for algorithmic choices Similar read/write performance Memory consistency semantics Reliability semantics XStack PI Meeting (03/22/2013)

Challenges and Opportunities Introspection Tools Applications Runtime Management End-to-End Data Movement Programming Constructs Heterogeneous Memory Semantics Consistency Reliability Power/Energy Efficiency Introspection Tools Chemistry Nuclear Physics Biology Programming Constructs Data Residence Annotations Memory Consistency Semantics Data Motion Description Runtime Performance/Power Management Weak Memory Consistency Integrated Data Movement Memory Reliability Management Hardware Simulators DOE Leadership Machines Accelerators CODEX XStack PI Meeting (03/22/2013)

End-to-end Data Movement

Everyone is a First-Class Citizen NVRAM Less Reliable Process Process Main Memory Main Memory Scratchpad NVRAM NVRAM Less Reliable Process Process Main Memory Main Memory Scratchpad NVRAM We envision an environment where all memory regions are first-class citizens, and a runtime system that provides for efficient data placement, and data movement capabilities XStack PI Meeting (03/22/2013)

Example Heterogeneous Architecture: Accelerator Clusters Graphics Processing Units (GPUs) Many-core architecture for high performance and efficiency (FLOPs, FLOPs/Watt, FLOPs/$) Programming Models: CUDA, OpenCL, OpenACC Explicitly managed global memory and separate address spaces CPU clusters MPI based DRAM to DRAM communication Host memory only Disjoint Memory Spaces! GPU Multiprocessor CPU MPI rank 0 MPI rank 1 MPI rank 2 MPI rank 3 Shared memory Main memory Global memory PCIe NIC XStack PI Meeting (03/22/2013)

Programming Heterogeneous Memory Systems (e.g: MPI+CUDA) GPU device memory GPU device memory Rank = 0 Rank = 1 PCIe PCIe CPU main memory CPU main memory Network Programmability/Productivity: Manual data movement leading to complex, non-portable codes Performance: Manual copy between host and GPU memory serializes PCIe, Interconnect Difficult for user to do optimal pipelining or utilize DMA engine efficiently Architecture-specific optimizations XStack PI Meeting (03/22/2013)

DMEM: A Model for Unified Data Movement Main Memory Main Memory Rank = 1 Rank = 0 GPU Memory GPU Memory CPU CPU Network NVRAM NVRAM Unreliable Memory Unreliable Memory if(rank == 0) { MPI_Send(any_buf, .. ..); } if(rank == 1) { MPI_Recv(any_buf, .. ..); } “MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems”, Ashwin Aji, James S. Dinan, Darius T. Buntinas, Pavan Balaji, Wu-chun Feng, Keith R. Bisset and Rajeev S. Thakur. IEEE International Conference on High Performance Computing and Communications (HPCC), 2012 XStack PI Meeting (03/22/2013)

DMEM Runtime Optimizations GPU Buffer Topology-aware pipelining of data Caching of meta-data (e.g., handles) Multi-stream data transfer when possible (e.g., newer accelerators) Architecture-specific optimizations: GPU Direct GPU (Device) CPU (Host) Host side Buffer pool Without Pipelining CPU (Host) With Pipelining 29% better than manual blocking 14.6% better than manual non-blocking Network Time XStack PI Meeting (03/22/2013)

Traditional Intranode Communication Communication without heterogeneous memory support 2 PCIe data copies + 2 main memory copies Transfers are serialized Process 0 Process 1 GPU Direct copy Shared Memory Host Integration allows direct transfer into shared memory buffer Sender and receiver drive transfer concurrently Pipeline data transfer Full utilization of PCIe links Direct Copy: DMA-driven peer GPU copy Peer-to-peer data transfer between heterogeneous memory regions XStack PI Meeting (03/22/2013)

Shared Memory Performance Less impact on D2D case PCIe latency dominant Improvement: 6.7% (D2D), 15.7% (H2D), 10.9% (D2H) Bandwidth discrepancy in different PCIe bus directions Improvement: 56.5% (D2D), 48.7% (H2D), 27.9% (D2H) Nearly saturates peak (6 GB/sec) in D2H case XStack PI Meeting (03/22/2013)

Direct DMA Performance Bandwidth nearly reaches the peak bandwidth of the system “DMA-Assisted, Intranode Communication in GPU Accelerated Systems”, Feng Ji, Ashwin Aji, James S. Dinan, Darius T. Buntinas, Pavan Balaji, Rajeev S. Thakur, Wu-chun Feng and Xiaosong Ma. IEEE International Conference on High Performance Computing and Communications (HPCC), 2012 XStack PI Meeting (03/22/2013)

Example – 2D Stencil Computation non-contiguous! GPU GPU CPU CPU cudaMemcpy cudaMemcpy high latency! MPI_Isend/Irecv CPU CPU cudaMemcpy cudaMemcpy 16 MPI transfers + 16 GPU-CPU xfers 2x number of transfers! GPU GPU XStack PI Meeting (03/22/2013)

“Compute-capable Memory” Optimizations Element-wise traversal by different threads Embarrassingly parallel problem, except for structs, where element sizes are not uniform threads B0 B1 B2 B3 Pack b1,0 b1,1 b1,2 Recorded by Dataloop # elements traverse by element #, read/write using extent/size XStack PI Meeting (03/22/2013)

Evaluating Memory-attached Computational Capabilities “Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments”, John Jenkins, James S. Dinan, Pavan Balaji, Nagiza F. Samatova and Rajeev S. Thakur. IEEE International Conference on Cluster Computing (Cluster), 2012 XStack PI Meeting (03/22/2013)

Epidemiology Simulation (EpiSimdemics) Episimdemics models try to understand the spatio-temporal diffusion/spread of a contagious disease through social contact networks of populations Represents social networks by labeled bipartite graphs with two disjoint sets as People and Locations. Duration of interaction between people is modeled using the activities and overlap of stay of different people at different locations. A variant of finite state machines, called probabilistic timed transition systems (PTTSs) is used to represent the within host disease propagation. PI: Madhav Marathe, Virginia Tech XStack PI Meeting (03/22/2013)

Case Study: Epidemiology Network PEi (Host CPU) GPUi (Device) 1. Copy to GPU 2. Process on GPU Network PEi (Host CPU) 1a. Pipelined data transfers to GPU GPUi (Device) 1b. Overlapped processing with internode CPU-GPU communication Traditional Model DMEM XStack PI Meeting (03/22/2013)

Evaluating the Epidemiology Simulation GPU has two orders of magnitude faster memory DMEM enables new application-level optimizations DMEM XStack PI Meeting (03/22/2013)

FDM-Seismological Modeling Modeling Seismic waves using analytical methods is highly complex due to irregular heterogeneity of earth interior, friction laws, realistic attenuation etc. Hence, approximate numerical methods such as Finite Difference Method (FDM) are used to solve differential wave equations. This application realizes staggered-grid velocity-stress FDM method for modeling seismic waves. Models the seismic waves by interpolating or triangulating the measured wave parameters at various seismic sensors. XStack PI Meeting (03/22/2013)

Case Study: Seismology XStack PI Meeting (03/22/2013)

Case Study: Seismology Up to 43% performance improvement Trade-offs Data marshaling on CPU vs. GPU? GPU is better + cudaMemcpy is avoided Data communication from CPU vs. GPU? CPU is better because PCIe hop is avoided “On the Efficacy of GPU-Integrated MPI for Scientific Applications”, Ashwin M. Aji, Lokendra S. Panwar, Feng Ji, Milind Chabbi, Karthik Murthy, Pavan Balaji, Keith R. Bisset, James Dinan, Wu-chun Feng, John Mellor-Crummey, Xiaosong Ma, and Rajeev Thakur. ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2013 XStack PI Meeting (03/22/2013)

Programming Constructs for Matching Application and Memory Semantics

Data Placement and Semantics in Heterogeneous Memory Architectures The memory usage characteristics of applications give the runtime system opportunities to place (and manage) data in different memory regions Read-intensive workloads that can get away with slightly slower memory bandwidth can use nonvolatile memory Workloads that have inherent errors in them might be able to get away with less-than-perfect memory reliability XStack PI Meeting (03/22/2013)

Courtesy Jeff Vetter, Oak Ridge National Laboratory Measurement Results D. Li, J.S. Vetter, G. Marin, C. McCurdy, C. Cira, Z. Liu, and W. Yu, “Identifying Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS). Shanghai: IEEEE, 2012 Courtesy Jeff Vetter, Oak Ridge National Laboratory XStack PI Meeting (03/22/2013)

Programming Model/Constructs Support for Memory Management Example: Static memory allocation Data movement constructs and annotations PGAS-like model to trap load/store accesses to predefined memory locations Read-intensive workloads with nonconflicting writes can be placed on NVRAM with store buffering Reordering and main memory caching can be internally employed by the runtime system __nvram__ int X[100]; int foo(void) { int x = X[15]; return 0; } int foo(void) { int x = __nvram_bar(X + 15); return 0; } Example: Dynamic Memory Migration int X[100]; int foo(void) { #pragma dmem read noconflict for for (i = 0; i < 100; i++) { Y[i] = X[i]; } return 0; XStack PI Meeting (03/22/2013)

Relaxed Memory Consistency Inter-process/thread memory consistency can be expensive Full memory barriers can take up several hundreds of cycles today for DRAM With NVRAM or slower memory models, this can be much more expensive Compiler/hardware provide eventuality semantics (data written by another process will “eventually” be visible to me); what “eventually” means can be different for different architectures Are strict consistency semantics always critical? In what cases can we relax these semantics? Thread 0: X = 1; flag = 1; Thread 1: while (flag); Y = X; Need memory barriers XStack PI Meeting (03/22/2013)

Summary Memory heterogeneity is becoming increasingly common Different memories have different characteristics Applications have already started investigating approaches to utilize these different memory regions Programming environments, however, have traditionally treated main memory as a special entity for data placement and data movement This can no longer be true – each memory architecture comes with its own set of capabilities and constraints – allowing applications to utilize each one of them as first-class citizens is critical XStack PI Meeting (03/22/2013)

Relevant Publications Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng, Pavan Balaji, James S. Dinan, Rajeev S. Thakur, Feng Ji, Xiaosong Ma, Milind Chabbi, Karthik Murthy, John Mellor-Crummey and Keith R. Bisset. “MPI-ACC: GPU-Integrated MPI for Scientific Applications.” (under preparation for IEEE Transactions on Parallel and Distributed Systems (TPDS)) John Jenkins, Pavan Balaji, James S. Dinan, Nagiza F. Samatova, and Rajeev S. Thakur. “MPI Derived Datatypes Processing on Noncontiguous GPU-resident Data.” (under preparation for IEEE Transactions on Parallel and Distributed Systems (TPDS)) Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng, Pavan Balaji, James S. Dinan, Rajeev S. Thakur, Feng Ji, Xiaosong Ma, Milind Chabbi, Karthik Murthy, John Mellor-Crummey and Keith R. Bisset. “On the Efficacy of GPU-Integrated MPI for Scientific Applications.” ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC). June 17-21, 2013, New York, New York Ashwin M. Aji, Pavan Balaji, James S. Dinan, Wu-chun Feng and Rajeev S. Thakur. “Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming.” Workshop on Accelerators and Hybrid Exascale Systems (AsHES); held in conjunction with the IEEE International Parallel and Distributed Processing Symposium (IPDPS). May 20th, 2013, Boston, Massachusetts John Jenkins, James S. Dinan, Pavan Balaji, Nagiza F. Samatova and Rajeev S. Thakur. “Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments.” IEEE International Conference on Cluster Computing (Cluster). Sep. 28-30, 2012, Beijing, China Feng Ji, Ashwin M. Aji, James S. Dinan, Darius T. Buntinas, Pavan Balaji, Rajeev S. Thakur, Wu-chun Feng and Xiaosong Ma. “DMA-Assisted, Intranode Communication in GPU Accelerated Systems.” IEEE International Conference on High Performance Computing and Communications (HPCC). June 25-27, 2012, Liverpool, UK Ashwin M. Aji, James S. Dinan, Darius T. Buntinas, Pavan Balaji, Wu-chun Feng, Keith R. Bisset and Rajeev S. Thakur. “MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems.” IEEE International Conference on High Performance Computing and Communications (HPCC). June 25-27, 2012, Liverpool, UK Feng Ji, James S. Dinan, Darius T. Buntinas, Pavan Balaji, Xiaosong Ma and Wu-chun Feng. “Optimizing GPU-to-GPU intra-node communication in MPI.” Workshop on Accelerators and Hybrid Exascale Systems (AsHES); held in conjunction with the IEEE International Parallel and Distributed Processing Symposium (IPDPS). May 25th, 2012, Shanghai, China XStack PI Meeting (03/22/2013)