Presentation is loading. Please wait.

Presentation is loading. Please wait.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES.

Similar presentations


Presentation on theme: "SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES."— Presentation transcript:

1 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES

2 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Overview Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend 2

3 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Overview Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend 3

4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Multicore CPU Backend: Introduction Target: Efficient execution of PTX kernels on CPUs ISA Translation from PTX to LLVM Execution-model translation from PTX thread hierarchy to serialized PTX threads Light-weight thread scheduler LLVM Just-in-time compilation to x86 LLVM transformations applied before code gen

5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Some Interesting Features Serialization Transforms JIT for Parallel Code Utilize all resources 5

6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Translation to CPUs: Thread Fusion Execution Manager thread scheduling context management Thread Blocks Multicore Host Threads Thread serialization Execution Model Translation Thread scheduling Dealing with specialized operations, e.g., custom hardware Control flow restructuring Resource management (multiple cores) Multiple address spaces One worker pthread per CPU core Execute a kernel 6 J. Stratton, S. Stone, and W. mei Hwu, Mcuda: An efficient implementation of cuda kernels on multi-cores," University of Illinois at Urbana-Champaign, Tech. Rep. IMPACT-08-01,March 2008. G. Diamos, A. Kerr, S. Yalamanchili and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk-Synchronous Applications in Heterogeneous,” PACT October 2010

7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot Source Code: Multicore CPU Backend ocelot/ executive/ interface/MulticoreCPUDevice.h interface/LLVMContext.h interface/LLVMExecutableKernel.h interface/LLVMCooperativeThreadArray.h interface/LLVMModuleManager.h interface/TextureOperations.h ir/ interface/LLVMInstruction.h translator/ interface/PTXToLLVMTranslator.h transforms/ interface/SubkernelFormationPass.h interface/RemoveBarrierPass.h 7

8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Multicore CPU: ISA Translation Translate PTX IR to LLVM Internal Representation Arithmetic instructions have one-to-few mapping Load store architectures Special instructions and registers handled by LLVM intrinsics (e.g. cos, clock64, bar.sync) Texture sampling calls Ocelot’s texture library LLVMContext contains pointers to address spaces, next entry ID, thread ID Custom LLVM IR implementation insulates Ocelot from LLVM changes LLVM requires SSA form -> Ocelot converts PTX to SSA Remove predication

9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX to LLVM ISA Translation // // ocelot/translation/implementation/PTXToLLVMTranslator.cpp // void PTXToLLVMTranslator::_translateAdd( const ir::PTXInstruction& i ) { if( ir::PTXOperand::isFloat( i.type ) ) { ir::LLVMFadd add; ir::LLVMInstruction::Operand result = _destination( i ); add.a = _translate( i.a ); add.b = _translate( i.b ); add.d = result; _llvmKernel->_statements.push_back( ir::LLVMStatement( add ) ); } else {.. }; } Translate each PTX instruction to LLVM IR instruction sequence Special PTX registers and instructions mapped to LLVM intrinsics: llvm.readcyclecounter() llvm.sqrt.f32() Result is LLVM function implementing PTX kernel Should be invertible if coupled to LLVM->PTX code generator (not implemented)

10 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Thread Serialization Thread loops Enter next executable region via scheduler block Barriers: store live values into thread-local memory, return to thread scheduler

11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Execution Management Translation takes place over (sub)kernels Code cache for translated kernels Must synthesize thread scheduling (serialization) code 11 Thread Serialization

12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Spilling Live Values // ocelot/analysis/implementation/RemoveBarrierPass.cpp // void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block, const DataflowGraph::Block::RegisterSet& alive ) { unsigned int bytes = 0; ir::PTXInstruction move ( ir::PTXInstruction::Mov ); move.type = ir::PTXOperand::u64; move.a.identifier = "__ocelot_remove_barrier_pass_stack"; move.a.addressMode = ir::PTXOperand::Address; move.a.type = ir::PTXOperand::u64; move.d.reg = _kernel->dfg()->newRegister(); move.d.addressMode = ir::PTXOperand::Register; move.d.type = ir::PTXOperand::u64; _kernel->dfg()->insert( block, move, block->instructions().size() - 1 );...

13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY... for( DataflowGraph::Block::RegisterSet::const_iterator reg = alive.begin(); reg != alive.end(); ++reg ) { ir::PTXInstruction save( ir::PTXInstruction::St ); save.type = reg->type; save.addressSpace = ir::PTXInstruction::Local; save.d.addressMode = ir::PTXOperand::Indirect; save.d.reg = move.d.reg; save.d.type = ir::PTXOperand::u64; save.d.offset = bytes; bytes += ir::PTXOperand::bytes( save.type ); save.a.addressMode = ir::PTXOperand::Register; save.a.type = reg->type; save.a.reg = reg->id; _kernel->dfg()->insert( block, save, block->instructions().size() - 1 ); } _spillBytes = std::max( bytes, _spillBytes ); } Spilling Live Values

14 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Using the Multicore Backend Edit configure.ocelot Controls Ocelot’s initial state Located in application’s startup directory executive controls device properties trace: Trace Generators may be active for devices other than PTX Emulator Only initialize(), finish() called event() and postEvent() never called Enables uniform interface for profiling kernel launches executive: devices: llvm – efficient execution of PTX on multicore CPU optimizationLevel – basic, none, full, memory, debug workerThreadLimit -- number of worker threads optimizations: subkernelSize - size of subkernels in instructions simplifyCFG – whether to apply CFG simplification pass hoistSpecialValues – whether to load LLVMContext values at launch of kernel executive: { devices: [ llvm ], asynchronousKernelLaunch: true, optimizationLevel: none, workerThreadLimit: 1, warpSize: 1 }, optimizations: { subkernelSize: 1000, simplifyCFG: true, hoistSpecialValues: true }, 14

15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Overview Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD Backend 15

16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY NVIDIA GPU: Introduction Executes PTX kernels on GPUs via the CUDA Driver API Thin layer on top of CUDA Driver API Ocelot enables rewriting of PTX kernels Register reallocation Runtime optimizations Instrumentation

17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot Source Code: NVIDIA GPU Device Backend ocelot/ executive/ interface/NVIDIAGPUDevice.h interface/NVIDIAExecutableKernel.h 17

18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Using the NVIDIA GPU Backend Edit configure.ocelot Controls Ocelot’s initial state Located in application’s startup directory executive controls device properties trace: Trace Generators may be active for devices other than PTX Emulator Only initialize(), finish() called event() and postEvent() never called Enables uniform interface for profiling kernel launches executive: devices: nvidia – invokes NVIDIA GPU backend executive: { devices: [ nvidia ], }, 18

19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Dynamic Instrumentation Run-time generation of user-defined, custom instrumentation code for CUDA kernels Harness chip-level instrumentation when possible Instrumentation data to drive Off-line workload characterization On-line debugging & program optimization On-line resource management Inspired in part by the PIN 1 infrastructure 19 1 C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. “Pin: building customized program analysis tools with dynamic instrumentation,” PLDI '05 PhD Student: Naila Farooqui, Joint with K. Schwan and A. Gavrilovska

20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Instrumentation Support in Ocelot  High-level, C constructs to define instrumentation + (C-to-PTX) JIT  Integration with system management software and dynamic compiler Online resource management based on profiling  Additional Instrumentor APIs to provide criteria for instrumentation Selectively perform instrumentation on kernels 20

21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Custom Instrumentation Transparent profiling and characterization of library implementations 21 nvcc PTX Ocelot Run Time CUDA Libraries Instrumentation APIs Instrumentor C-on-Demand JIT C-PTX Translator PTX-PTX Transformer Lynx Example Instrumentation Code 21

22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Instrumentation: Instruction count * Scan (CUDA SDK)

23 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Remote Device Layer Remote procedure call layer for Ocelot device calls Execute local applications that run kernels remotely Multi-GPU applications can become multi-node 23

24 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Switchable Compute Switch devices at runtime Load balancing Instrumentation Fault-and-emulate Remote execution

25 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Overview Ocelot PTX Emulator Multicore-Backend NVIDIA Backend AMD GPU Backend 25 Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4

26 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY AMD GPU Backend Executes PTX kernels on GPUs via the CAL Driver API Rewriting of PTX kernels (for optimization, instrumentation, etc.) also gets translated to the AMD backend Ocelot Device Interface: Module registration Memory management Global/Shared/Constant/Parameter memory allocation Kernel launches Translation from PTX to IL Texture management OpenGL interoperability Streams and Events Rodrigo Dominguez, Dana Schaa, and David Kaeli. “Caracal: Dynamic Translation of Runtime Environments for GPUs.” In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4

27 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY AMD Evergreen Architecture  AMD Radeon HD 5870  20 SIMD cores  16 Stream Cores (SC) per SIMD core  Each SC is VLIW-5  A total of 1600 ALUs  Wavefronts of 64 threads  Peak is 2.72 TFLOPS (SP) and 544 GFLOPS (DP)

28 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY AMD Evergreen Architecture One SIMD Engine Source: AMD OpenCL University Kit General Purpose Registers One Stream Core T-Processing Element Branch Execution Unit Processing Elements Instruction and Control Flow Each Stream Core includes:  4 Processing Elements  4 independent SP or integer operations  2 DP operation  1 DP fma or mult operation  1 Special Function Unit  1 SP or integer operation  SP or DP transcendental  Branch Execution Unit  GPR = 5.24 MB

29 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY AMD Evergreen Architecture  Local Data Share  2 TB/s  32 KB per SIMD  Global Data Share  Shared between all threads in a kernel  Low latency global reductions  L1 (8 KB)  L2  512 KB  450 GB/s  Global Memory  GDDR5 153 GB/s  CompletePath  FastPath

30 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Memory Hierarchy Crossbar L1 Cache SIMD Engine Local Mem + Registers L2 Cache Write Cache Atomic Path Global Memory

31 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Memory Hierarchy Benefits from vector operations (int4, float4) Atomics are faster on local memory than in global memory (FastPath vs CompletePath) Typical tiled data layout for images to hit L1 cache Compiler optimizes by: Minimizing ALU code Maximizing number of threads Scheduling instructions to increase VLIW packing

32 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Address Spaces Unordered Access Views (raw) 8 different UAVs Byte-addressable (linear) Dword (4 bytes) alignment Arena UAV for sub-dword data Constant Buffers Non-linear addressing (x, y, z, w components) Local Data Share Byte-addressable (linear) Dword-aligned and dword-sized (pack/unpack overhead)

33 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Translation from PTX to IL PTX RISC style syntax Load-Store instruction set Registers are typed and scalar Unlimited virtual registers Predicate registers Control flow based on branches and labels Designed for compute (GPGPU).entry vecAdd (.param.u64 A,.param.u64 B,.param.u64 C,.param.s32 N) { mov.u16 rh1, ctaid.x; mov.u16 rh2, ntid.x; mul.wide.u16 r1, rh1, rh2; cvt.u32.u16 r2, tid.x; add.u32 r3, r2, r1; ld.param.s32 r4, [N]; setp.le.s32 p1, r4, r3; @p1 bra Label_1;... }

34 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Translation from PTX to IL IL Registers are 32-bit and vectors (4 components) Registers have no type Swizzles and destination modifiers Resources are globally scoped Structured control flow (if- end, while-end) Designed for graphics, not compute (see FSAIL) il_cs_2_0 dcl_raw_uav_id(0) dcl_cb cb0[2] dcl_cb cb1[4] dcl_literal l0, 4, 4, 4, 4 mov r0.x, vThreadGrpId.x mov r1.x, cb0[0].x imul r2.x, r0.x, r1.x mov r3.x, vTidInGrp.x iadd r4.x, r3.x, r2.x mov r5.x, cb1[3].x ige r6.x, r4.x, r5.x if_logicalz r6.x... endif end

35 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY AMD GPU Backend Validated over 30 applications from the CUDA SDK Support for pre-compiled libraries Device selection can be made at runtime What is supported? Global memory (cudaMalloc, cudaMemcpy) Shared memory (including extern) Constant memory (no caching) Atomics (global and shared) Barriers and Fences 30+ PTX instructions Rodrigo Dominguez, Dana Schaa, and David Kaeli. Caracal: Dynamic Translation of Runtime Environments for GPUs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4

36 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot Source Code: AMD GPU Device Backend ocelot/ analysis/ interface/StructuralAnalysis.h executive/ interface/ATIGPUDevice.h interface/ATIExecutableKernel.h transforms/ interface/StructuralTransform.h 36

37 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Using the AMD GPU Backend Edit configure.ocelot Controls Ocelot’s initial state Located in application’s startup directory executive controls device properties trace: Trace Generators may be active for devices other than PTX Emulator Only initialize(), finish() called event() and postEvent() never called Enables uniform interface for profiling kernel launches executive: devices: amd – invokes AMD GPU backend executive: { devices: [ amd ], }, 37

38 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Unstructured to Structured Control Flow* Branch Divergence is key to high performance in GPU Its impact is different depending upon whether the control flow is structured or unstructured Not all GPUs support unstructured CFG directly Using dynamic translation to support AMD GPUs** 38 * Wu H, Diamos G, Li S, Yalamanchili S. Characterization and Transformation of Unstructured Control Flow in GPU Applications. CACHES. 2011. ** R. Dominguez, D. Schaa, and D. Kaeli. Caracal: Dynamic translation of runtime environments for gpus. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 5–11. ACM, 2011.

39 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Structured/Unstructured Control Flow Structured Control Flow has a single entry and a single exit Unstructured Control Flow has multiple entries or exits 39 Exit Entry if-then-else Entry /Exit for-loop/while-loop do-while-loop Entry Exit

40 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sources of Unstructured Control Flow (1/2) goto statement of C/C++ Language semantics 40 Not all conditions need to be evaluated Sub-graphs in red circles have 2 exits B1 bra cond1() B4 bra cond4() B2 bra cond2() B3 bra cond3() B5 …… entry exit if (cond1() || cond2()) && cond3() || cond4())) { …… }

41 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Re-convergence in AMD & Intel GPUs AMD IL does not support arbitrary branch It also uses ELSE, LOOP, ENDLOOP, etc. Intel GEN5 works in a similar manner 41 ige r6, r4, r5 if_logicalz r6 uav_raw_load_id(0) r11, r10 uav_raw_load_id(0) r14, r13 iadd r17, r16, r8 uav_raw_store_id(0) r17, r15 endif ige r6, r4, r5 if_logicalz r6 uav_raw_load_id(0) r11, r10 uav_raw_load_id(0) r14, r13 iadd r17, r16, r8 uav_raw_store_id(0) r17, r15 endif if (i < N) { C[i] = A[i] + B[i] } if (i < N) { C[i] = A[i] + B[i] } C CodeAMD IL

42 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Entry B1 B2 B3 B4 B5 T0T1T2T3T4T5T6 B2 B3 Re-converge at immediate post-dominator 42 B1 bra cond1() B4 bra cond4() B2 bra cond2() B3 bra cond3() B5 …… entry exit B5 B3 B4 B5 Exit Entry B1 B2 B3 B4 B5 T0T1T2T3T4T5T6 B3 B4 B3 B4 B5 B3 B5 Exit 1 2 3 4 5 6 7 8 9 10 11 12 B5 B3 B4 B5 B3 B4 B5


Download ppt "SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES."

Similar presentations


Ads by Google