Download presentation
Presentation is loading. Please wait.
Published byGary Park Modified over 9 years ago
1
Jan. 2009 (C)RG@SERC,IISc Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc govind@serc.iisc.ernet.in
2
Jan. 2009 © RG@SERC,IISc 2 HPC Design Using Accelerators High level of performance from Accelerators Variety of general-purpose hardware accelerators –GPUs : nVidia, ATI, –Accelerators: Clearspeed, Cell BE, … –Plethora of Instruction Sets even for SIMD Programmable accelerators, e.g., FPGA-based HPC Design using Accelerators –Exploit instruction-level parallelism –Exploit data-level parallelism on SIMD units –Exploit thread-level parallelism on multiple units/multi-cores Challenges –Portability across different generation and platforms –Ability to exploit different types of parallelism
3
Jan. 2009 © RG@SERC,IISc 3 Accelerators – Cell BE
4
Jan. 2009 © RG@SERC,IISc 4 Accelerators - 8800 GPU
5
Jan. 2009 © RG@SERC,IISc 5 The Challenge
6
Jan. 2009 © RG@SERC,IISc 6 Programming in Accelerator- Based Architectures Develop a framework –Programmed in a higher-level language, and is efficient –Can exploit different types of parallelism on different hardware –Parallelism across heterogeneous functional units –Be portable across platforms – not device specific! Jointly with Prof. Matthew Jacob Architecture Lab., SERC, IISc
7
Jan. 2009 © RG@SERC,IISc 7 Existing Approaches StreaMIT RAWCellBE Compiler Accelerator GPUs Runtime System Brooks GPUs Compiler C/C++ SSE/ Altivec Auto vectorizer
8
Jan. 2009 © RG@SERC,IISc 8 What is needed Compiler/ Runtime System
9
Jan. 2009 © RG@SERC,IISc 9 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System
10
Jan. 2009 © RG@SERC,IISc 10 Two-Pronged Approach CUDA Profile-based Compiler GPUsMulticore PLASMA: High-Level Intermediate Representation Compiler and Runtime System StreaMIT
11
Jan. 2009 © RG@SERC,IISc 11 Stream Programming Model Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. Exposes Pipelined parallelism and Task-level parallelism Temporal streaming of data Synchronous Data Flow (SDF), Stream Flow Graph, StreamMIT, Brook, … Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules Mapping applications to Accelerators such as GPUs and Cell BE.
12
Jan. 2009 © RG@SERC,IISc 12 The StreamIt Language Streamit programs are a hierarchical composition of three basic constructs: –Pipeline –SplitJoin Round-robin or duplicate splitter –Feedback Loop Stateful filters Peek values... Splitter Filter Stream Joiner BodySplitter Loop
13
Jan. 2009 © RG@SERC,IISc 13 StreaMIT No. of Push/Pop values fixed and known at compile-time Multi-rate firing Dup. Splitter Bandpass Filter + Amplifier Combiner Signal Source Bandpass Filter + Amplifier 2 – Band Equalizer
14
Jan. 2009 © RG@SERC,IISc 14 Multi-Rate Firing Consistent firing rate of nodes to ensure no data accumulation on channels If node A fires 3 times, B should fire twice, and C should fire 4 times Solving a set of linear equations! N A * 2 = N B * 3 N B * 4 = N C * 2 Multiple solutions possible Primitive steady-state solution (firing rates) B A C 2 3 4 2
15
Jan. 2009 © RG@SERC,IISc 15 StreamIt on GPUs StreamIt provides a convenient way of programming GPUs More ”natural” than frameworks like CUDA or CTM for most domains Easier learning curve than CUDA, programmer does not need to think of the program in terms of ”threads” or blocks, but only as a set of communicating filters StreamIt programs are easier to verify, since the I/O rates of each filter are static, and hence the schedule can be determined entirely at compile time.
16
Jan. 2009 © RG@SERC,IISc 16 Challenges on GPUs Work distribution between the multiprocessors –GPUs have hundreds of processors (SMs and SIMD units)! Exploiting task-level and data-level parallelism –Scheduling across the multiprocessors –Multiple concurrent threads in SM to exploit DLP Determining the execution configuration (number of threads for each filter) that minimizes execution time. Register constraints (eventhough ~1000s of them) Lack of synchronization mechanisms between the multiprocessors of the GPU. Managing CPU-GPU memory bandwidth efficiently ”Stateless” filters exploit data parallelism, but ”stateful” filters require special attention.
17
Jan. 2009 © RG@SERC,IISc 17 Existing Approaches Single Threaded SIMD Execution
18
Jan. 2009 © RG@SERC,IISc 18 Existing Approaches (contd.) Execution on Cell BE Our Approach for GPUs
19
Jan. 2009 © RG@SERC,IISc 19 Compiling Stream Programs to CUDA for GPUs Software Pipeline the execution of the stream program on the GPUSoftware Pipeline –This takes care of synchronization and consistency issues, since the multiprocessors can execute their work in a decoupled fashion, with kernel invocations being the only synchronization points. –Work distribution and scheduling are accomplished by formulating the problem as a unified Integer Linear Program and solving it, using standard ILP solvers. –The ILP formulation is sufficiently simple to be solved in a few seconds on current hardware.
20
Jan. 2009 © RG@SERC,IISc 20 Example Loop: LD F0, 0(R1) ADDD F4, F2, F0 ST 0(R1), F4 Add R1, R1, #8 Sub R2, R2, #1 Beqz R2, Loop Target Assembly Code DDG for (i=0 ; i < n ; i++) A[i] = A[i] + s; High Level Code Ld Addd Add 3 2 St Sub Beq
21
Jan. 2009 © RG@SERC,IISc 21 Basic Block Scheduling A target arch, with 1 Int, 1 FP, 1 Ld/St, and 1 Branch FUs. Load latency = 2 cycles FP Latency = 3 cycles All other instrns. take 1 cycle TInt.Ld/StFPBr. 1Ld 2 3SubAddd 4 5 6AddStBeq 7Ld 8 9SubAddd 10 11 12AddStBeq 6 cycles for each iteration. Ld Addd Add 3 2 St Sub Beq
22
Jan. 2009 © RG@SERC,IISc 22 Overlapped Execution of Iterations TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq 7 8 9 10 11 12 SubLd Add Addd StBeq SubLd Add Addd StBeq Schedule the Add (and Sub ) early –May cause problem with St due to anti-dependence (WAR) Offset of store can be adjusted (-8 or -16 can be used!) –Enables the next Ld to be scheduled sooner! Repetitive pattern appears! Throughput = 2 cycles per iteration!
23
Jan. 2009 © RG@SERC,IISc 23 Prolog Kernel (repeated n-2 times) Epilog TInt.Ld/StFPBr. 1SubLd 2Add 3Addd 4 5 6StBeq 7 8 9 10 11 12 SubLd Add Addd StBeq SubLd Add Addd StBeq Overlapped Execution of Iterations
24
Jan. 2009 © RG@SERC,IISc 24 Stream Graph Execution Stream Graph Buffer requirement = 4 x A C D B SIMD Execution A1A2 SM1SM2SM3SM4 A3A4 B1B2B3B4 D3 C3 D4 C4 D1 C1 D2 C2 0 1 2 3 4 5 6 7
25
Jan. 2009 © RG@SERC,IISc 25 Stream Graph Execution Stream Graph Software Pipelined Execution Buffer requirement = 2 x A C D B SM1SM2SM3SM4 A1A2 A3A4 B1B2 B3B4 D1 C1 D2 C2 D3 C3 D4 C4 0 1 2 3 4 5 6 7
26
Jan. 2009 © RG@SERC,IISc 26 Our Approach Good execution configuration determined by using profiling – Identify near-optimal no. of concurrent thread instances per filter. –Takes into consideration register contrainsts Formulate work scheduling and processor (SM) assignment as a unified Integer Linear Program problem. –Takes into account communication bandwidth restrictions Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Stateful filters are assigned to CPUs – synergistic execution of CPUs and GPUs is ongoing work!
27
Jan. 2009 © RG@SERC,IISc 27 ILP Formulation Resource Constraints : w k,v,p = 1 kth instance of filter v mapped to SM p
28
Jan. 2009 © RG@SERC,IISc 28 ILP Formulation Dependence Constraint : (j,k,v) -- Sched. Time of kth instance of filter v in steady state iteration j o k,v specifies time within the SWP kernel f k,v specifies the stage of the SWP kernel Filter execution must complete by kernel end
29
Jan. 2009 © RG@SERC,IISc 29 ILP Formulation Dependence Constraint (contd.): Admissibility of the schedule is given by: Constraint solving the above equations gives the schedule!
30
Jan. 2009 © RG@SERC,IISc 30 Compiler Framework
31
Jan. 2009 © RG@SERC,IISc 31 Experimental Results Speedup on GPU (8800) compared to CPU of stream programs Filters are coarsened before scheduling!
32
Jan. 2009 © RG@SERC,IISc 32 Experimental Results (contd.) Improvements due to buffer coalescing More results in the CGO-09 paper!
33
Jan. 2009 © RG@SERC,IISc 33 Two-Pronged Approach Compiler/ Runtime System CUDA Profile-based Compiler GPUsMulticore
34
Jan. 2009 © RG@SERC,IISc 34 Challenges Different SIMD Architectures (Threaded (GPU) vs. Short Vector (CPU)) Multiple Homogeneous cores Heterogeneous Accelerators Distributed Memory on chip!
35
Jan. 2009 © RG@SERC,IISc 35 What should a solution provide? Rich abstractions for Functionality –Not a lowest common denominator Independence from any single architecture Portability without compromises on efficiency –Don't forget high-performance goals of the ISA Scale-up and scale down –Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, etc.) Transparent Distributed Memory PLASMA: Portable Programming for PLASTIC SIMD Accelerators
36
Jan. 2009 © RG@SERC,IISc 36 Our Approach Stream Program Intermediate Representation Cuda, C with Intrinsics, Stream or Other high-level program model to a high-level intermediate language –Perform suitable compiler optimization –Intermediate representation expressive enough to handle (target) machine specificities IR to Target machine –Exploit SIMD and thread-level parallelism –Agnostic to SIMD width –Manages heterogeneous memory
37
Jan. 2009 © RG@SERC,IISc 37 PLASMA Overview
38
Jan. 2009 © RG@SERC,IISc 38 PLASMA IR Operator –Add, Mult, … Vector –1-D bulk data type of base types –E.g. Distributor –Distributes operator over vector –Example: par add returns Vector composition –Concat, slice, gather, scatter, … Reduce Add Par Mul SliceV M Matrix-Vector Multiply par mul, temp, A[i * n:i * n + n:1], X reduce add, Y[i:i + 1:1], temp
39
Jan. 2009 © RG@SERC,IISc 39 Our Framework “CPLASM”, a prototype high-level assembly language Prototype PLASMA IR Compiler Currently Supported Targets: C (Scalar), SSE3, CUDA (NVIDIA GPUs) Future Targets: Cell, ATI, ARM Neon,... Compiler Optimizations for this “Vector” IR
40
Jan. 2009 © RG@SERC,IISc 40 Our Framework (contd.)
41
Jan. 2009 © RG@SERC,IISc 41 Experimental Results Kernel programs written in CPLASM Compiled to C or CUDA, exposing SIMD parallelism Execution on SSE2 or GPU Comparison with hand-optimized library
42
Jan. 2009 © RG@SERC,IISc 42 Initial Results Compares well with hand-optimized library kernels Blocking (tiling) optimization can lead to better performance
43
Jan. 2009 © RG@SERC,IISc 43 Future Directions Synergistic execution of stream program in CPU and GPU. Support for multiple heterogeneous functional units Retargetting PLASMA for multiple accelerators Extending the framework beyond Stream Programming models
44
Jan. 2009 (C)RG@SERC,IISc Thank You !!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.