Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

Monday HW answers: p B25. (x – 15)(x – 30) 16. (t – 3)(t – 7)29. (x -2)(x – 7) 19. (y – 6)(y + 3)roots = 2 and (4 + n)(8 + n)34. (x + 7)(x.
The following 5 questions are about VOLTAGE DIVIDERS. You have 20 seconds for each question What is the voltage at the point X ? A9v B5v C0v D10v Question.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
Agents & Intelligent Systems Dr Liz Black
Prasanna Pandit R. Govindarajan
Garuda-PM July 2011 Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc
FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.
1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Shredder GPU-Accelerated Incremental Storage and Computation
Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)
1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
25 seconds left…...
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
GPU Architecture and Programming
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
My Coordinates Office EM G.27 contact time:
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
6- General Purpose GPU Programming
Presentation transcript:

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc ATIP 1 st Workshop on HPC in SC-09

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 2 Current Trend in HPC Systems  Top500 systems have hundreds of thousand (100,000s) cores  Large HPCs.  Performance scaling major challenge  No. of cores in a processor/node is increasing!  4 – 6 cores per processor, cores/node!  Parallelism even at the node level  Top systems use accelerators  GPUs and CellBEs  1000s of cores/proc. Elements in a single GPU!

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 3 HPC Design Using Accelerators  High level of performance from Accelerators  Variety of general-purpose hardware accelerators  GPUs : nVidia, ATI,  Accelerators: Clearspeed, Cell BE, …  Plethora of Instruction Sets even for SIMD  Programmable accelerators, e.g., FPGA-based  HPC Design using Accelerators  Exploit instruction-level parallelism  Exploit data-level parallelism on SIMD units  Exploit thread-level parallelism on multiple units/multi-cores  Challenges  Portability across different generation and platforms  Ability to exploit different types of parallelism

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 4 Accelerators – Cell BE

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 5 Accelerators GPU

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 6 The Challenge SSE CUDA OpenCL ArmNeon AltiVec AMD CAL

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 7 Programming in Accelerator- Based Architectures  Develop a framework  Programmed in a higher-level language, and is efficient  Can exploit different types of parallelism on different hardware  Parallelism across heterogeneous functional units  Be portable across platforms – not device specific!

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 8 C/C++ CPU Auto vectorizer SSE/ Altivec CUDA/ OpenCL Compiler nvCC/JIT CPU GPUs PTX/ATI CAL IL Brook Compiler CPU GPUs ATI CAL IL Existing Approaches

R. Govindarajan ATIP 1 st Workshop on HPC in SC-09 9 StreaMIT CellBERAW StreamIT Compiler Accelerator CPU GPUs DirectX Runtim e Std. Compiler OpenMP Std. Compiler CPU GPUs Existing Approaches (contd.)

R. Govindarajan ATIP 1 st Workshop on HPC in SC Synergistic Execution on Multiple Hetergeneous Cores What is needed? Compiler/ Runtime System CellBE Other Aceel. Multicores GPUs SSE Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang.

R. Govindarajan ATIP 1 st Workshop on HPC in SC What is needed? Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System

R. Govindarajan ATIP 1 st Workshop on HPC in SC Stream Programming Model  Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them.  Exposes Pipelined parallelism and Task-level parallelism  Temporal streaming of data  Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, …  Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules  Mapping applications to Accelerators such as GPUs and Cell BE.

R. Govindarajan ATIP 1 st Workshop on HPC in SC  Streamit programs are a hierarchical composition of three basic constructs:  Pipeline  SplitJoin Round-robin or duplicate splitter  Feedback Loop  Stateful filters  Peek values... Filter Splitter Stream Joiner BodySplitter Loop The StreamIt Language

R. Govindarajan ATIP 1 st Workshop on HPC in SC  More ”natural” than frameworks like CUDA or CTM  Easier learning curve than CUDA  No need to think of ”threads” or blocks,  StreamIt programs are easier to verify,  Schedule can be determined statically. Why StreamIt on GPUs

R. Govindarajan ATIP 1 st Workshop on HPC in SC  Work distribution across multiprocessors  GPUs have hundreds of processing pipes!  Exploit task-level and data-level parallelism  Schedule across the multiprocessors  Multiple concurrent threads in SM to exploit DLP  Execution configuration: task granularity and concurrency  Lack of synchronization between the processors of the GPU.  Managing CPU-GPU memory bandwidth Issues on Mapping StreamIt for GPUs

R. Govindarajan ATIP 1 st Workshop on HPC in SC Stream Graph Execution Stream Graph Software Pipelined Execution A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C Pipeline Parallelism Task Parallelism Data Parallelism

R. Govindarajan ATIP 1 st Workshop on HPC in SC Our Approach Our Approach for GPUs  Code for SAXPY float->float filter saxpy { float a = 2.5f; work pop 2 push 1 { float x = pop(); float y = pop(); float s = a * x + y; push(s); }

R. Govindarajan ATIP 1 st Workshop on HPC in SC  Multithreading  Identify good execution configuration to exploit the right amount of data parallelism  Memory  Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.  Task Partition between GPU and CPU cores  Work scheduling and processor (SM) assignment problem.  Takes into account communication bandwidth restrictions Our Approach (contd.)

R. Govindarajan ATIP 1 st Workshop on HPC in SC Execution Configuration Exec. Time of Macro Node = 32 Exec. Time of Macro Node = 16 A0A1A127 B0B1B127B0B1B127 Total Exec. Time on 2 SMs = MII = 64/2 = 32 More threads for exploiting data-level parallelism

R. Govindarajan ATIP 1 st Workshop on HPC in SC  GPUs have a banked memory architecture with a very wide memory channel  Accesses by threads in an SM have to be coalesced d0d0 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 B0B0 B1B1 B2B2 B3B3 B0B0 B1B1 B2B2 B3B3 thread 0 thread 2 thread 1 thread 3 d0d0 d2d2 d4d4 d6d6 d1d1 d3d3 d5d5 d7d7 B0B0 B1B1 B2B2 B3B3 B0B0 B1B1 B2B2 B3B3 thread 0 thread 2 thread 1 thread 3 Coalesced Memory Accessing

R. Govindarajan ATIP 1 st Workshop on HPC in SC Execution on CPU and GPU  Problem: Partition work across CPU and GPU  Data transfer between GPU and Host memory required based on the partition!  Coalesced access is efficient for GPU, but harmful for CPU!  Transform data before move from/to GPU memory  Reduce the overall execution time, taking into account memory transfer and transform delays!

R. Govindarajan ATIP 1 st Workshop on HPC in SC Scheduling and Mapping CPU Load:45 GPU Load:40 DMA Load:40 MII:45 B AC D E GPU:20 CPU:20 GPU:20 CPU:15 CPU: B AC D E CPU:10 GPU:20 CPU:20 CPU:80 GPU:20 CPU:15 GPU:10 CPU:10 GPU: Initial StreamIt GraphPartitioned Graph

R. Govindarajan ATIP 1 st Workshop on HPC in SC B n-2 D n-6 E n-7 B n-1 A n-1 B n-3 C n-3 D n-5 C n-5 AnAn C n-4 CPUDMA ChannelGPU B A C D E GPU:20 CPU:20 GPU:20 CPU:15 CPU: Scheduling and Mapping

R. Govindarajan ATIP 1 st Workshop on HPC in SC Compiler Framework Execute Profile Runs Generate Code for Profiling Configuration Selection StreamIt Program Task Partitioning Task Partitioning ILP Partitioner Heuristic Partitioner Instance Partitioning Instance Partitioning Modulo Scheduling Code Generation CUDA Code + C Code

R. Govindarajan ATIP 1 st Workshop on HPC in SC  Significant speedup for synergistic execution Experimental Results on Tesla > 52x> 32x> 65x

R. Govindarajan ATIP 1 st Workshop on HPC in SC What is needed? Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System

R. Govindarajan ATIP 1 st Workshop on HPC in SC  Rich abstractions for Functionality  Independence from any single architecture  Portability without compromises on efficiency  Scale-up and scale down  Single core embedded processor to multi-core workstation  Take advantage of Accelerators (GPU, Cell, …)  Transparent Distributed Memory  PLASMA: Portable Programming for PLASTIC SIMD Accelerators IR: What should a solution provide?

R. Govindarajan ATIP 1 st Workshop on HPC in SC PLASMA IR Reduce Add Par Mul SliceV M Matrix-Vector Multiply par mul, temp, A[i *n : i *n+n : 1], X reduce add, Y[I : i+1 : 1], temp

R. Govindarajan ATIP 1 st Workshop on HPC in SC  “CPLASM”, a prototype high-level assembly language  Prototype PLASMA IR Compiler  Currently Supported Targets:  C (Scalar), SSE3, CUDA (NVIDIA GPUs) ‏  Future Targets:  Cell, ATI, ARM Neon,...  Compiler Optimizations for this “Vector” IR Our Framework

R. Govindarajan ATIP 1 st Workshop on HPC in SC Our Framework (contd.)  Plenty of optimization opportunities!

R. Govindarajan ATIP 1 st Workshop on HPC in SC PLASMA IR Performance  Normalized exec. Time comparable to that of hand-tuned library!

R. Govindarajan ATIP 1 st Workshop on HPC in SC Ongoing Work Streaming Lang. MPI OpenMP CUDA/ OpenCL Array Lang. (Matlab) Parallel Lang. CellBE Other Aceel. Multicores GPUs SSE Synergistic Execution on Multiple Hetergeneous Cores PLASMA: High-Level IR Compiler Runtime System  Look at other high level languages !  Target other accelerators

R. Govindarajan ATIP 1 st Workshop on HPC in SC Compiling OpenMP/MPI / X10  Mapping the semantics  Exploiting data parallelism and task parallelism  Communication and synchronization across CPU/GPU/Multiple Nodes  Accelerator-specific optimization  Memory layout, memory transfer, …  Performance and Scaling

R. Govindarajan ATIP 1 st Workshop on HPC in SC Thank You !!  My students!  IISc and SERC  Microsoft and Nvidia  ATIP, NSF, all Sponsors  ONR Acknowledgements