SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Special thanks to our sponsors: IBM, Intel, LogicBlox, NSF, NVIDIA

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Performance Modeling CUDA Productivity Tools

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Productivity Challenges of Heterogeneity Execution Portability – Systems evolve over time – New systems Performance – Debugging and optimization  Application Migration – Protect investments in existing code bases esd.lbl.gov math.harvard.edu Sandia.gov Harmony Ocelot/LLVMNVIDIA JIT OS/VM Language Front-End Emerging Software Stacks Productivity Tools

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot Vision Just-in-time code generation and optimization for CUDA applications Basis for a range of productivity and workload characterization tools esd.lbl.gov

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Project Goals Encourage proliferation of GPU computing Lower the barriers to entry for researchers and developers Understand performance behavior of CUDA applications on multiple platforms Translate, optimize, debug, and execute CUDA applications on multiple platforms

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: Dynamic Execution Infrastructure 7 Dynamic Translation (LLVM)PTX Emulator Performance Analysis and Modeling Productivity ToolsMultiplatform Support Multi-GPU Support PTX Kernel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot PTX Intermediate Representation (IR) Backend compiler framework for PTX Full-featured PTX IR Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization IR to IR translation From PTX to other IRs LLVM (x86/PowerPC/ARM) CAL (AMD GPUs) PTX Kernel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example Usage Scenarios Domain specific IR PTX IR Interface Switchable Compute Run- Time Flexible Event Interface Programming Languages Compiler Optimizations Operating Systems & Run-Times Computer Architecture Research Optimization of Database Applications Thread fusion, control flow optimization, etc. Kernel scheduling and data movement optimizations Instruction and/or memory trace generation Research CommunitiesOcelot InterfaceDriving Research Workload Characterization/Tuning Flexible Event Interface Control flow behavior, data sharing patterns

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY NVIDIA’s Parallel Thread eXecution ISA PTX defines an execution model that abstracts cores as CTAs and hardware threads/SIMD units as threads

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot PTX Emulator Performs functional simulation of the PTX execution model Enables detailed performance evaluation/debugging Program correctness checking Workload characterization and modeling Architecture simulation PTX 1.4 support Fermi support in progress Abstract machine model

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CUDA on CPUs Execution model translation Ocelot fuses CUDA threads into heavy-weight pthreads

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Switchable Compute Switch devices at runtime Load balancing Instrumentation Fault-and-emulate Remote execution

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Statistical Modeling Prior work has used Ocelot for statistical workload/architecture modeling Application/machine metrics collected via static analysis, instrumentation, and emulation Principal Component Analysis (PCA) combines related metrics Cluster analysis identifies related sets of applications Regression models use metrics to predict application behavior

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Performance Predictions New application on same GPU Left 12 applications train right 12 GTX280 Same applications on new GPU Trained on: 8600 GS, GTX280, C1060 Test: GeForce 8800 GTX

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Performance Modeling Productivity Tools Program correctness Performance Workload characterization

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – CUDA Debugging – Race Detection // file: raceCondition.cu __global__ void raceCondition(int *A) { __shared__ int SharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load }... raceCondition >>( validPtr );... ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near raceCondition.cu:9:0

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – Performance Tuning.... for (int offset = 1; offset < n; offset *= 2) {// line 61 pout = 1 - pout; pin = 1 - pout; __syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset]; }.... Identify critical regions Memory demand Floating-point intensity Shared memory bank conflicts hot cold

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Questions? Download Ocelot: http://code.google.com/p/gpuocelot/ (New BSD License) Contact us: gregory.diamos@gatech.edugregory.diamos@gatech.edu, arkerr@gatech.edu, sudha@ece.gatech.eduarkerr@gatech.edu sudha@ece.gatech.edu Mailing list: gpuocelot@googlegroups.com

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Additional Material

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example PTX  PTX Optimization Ocelot includes a comprehensive optimization framework for PTX Implementation modeled after LLVM passes Optimizations can be applied dynamically between kernel executions

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX IR – CPU Backend – Memory Traversal 2 This pattern significantly affects throughput-bound applications

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot CUDA Runtime A complete reimplementation of the CUDA Runtime API Clean device abstraction All back-ends implement same interface Compatible with existing applications Link against libocelot.so instead of libcudart Ocelot API Extensions Add/remove trace generators Compile/launch kernels directly in PTX Device memory sharing among host threads Device switching

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – CUDA Debugging- Erroneous Memory Accesses // file: memoryCheck.cu __global__ void badMemoryReference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64); badMemoryReference >>( invalidPtr ); return 0; } ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – Interactive Debugger Interactive command-line debugger implemented using TraceGenerator interface $./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg) Enables Inspection of application state Single-stepping of instructions Breakpoints and watchpoints Faults in MemoryChecker and RaceDetector invoke ocelot- dbg automatically

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Trace Generator Interface Emulator broadcasts events to trace generators during execution Events describe instruction execution, memory addresses, etc. Used for error checking, instrumentation, and simulation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Trace Generator Interface //! Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::ExecutableKernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postEvent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); };

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY TraceEvent Object Captures execution of a dynamic instruction PC and internal representation of PTX instruction Set of active threads executing instruction Kernel grid and CTA dimensions Memory addresses and size of transfer Branch target(s)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example – Thread Load Imbalance //! Computes number of dynamic instructions for each thread class ThreadLoadImbalance: public trace::TraceGenerator { public: std::vector dynamicInstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtual void event(const TraceEvent & event) { if (!dynamicInstrucitons.size()) dynamicInstructions.resize(event.active.size(), 0); for (int i = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicInstructions[i]++; } }; Mandelbrot (CUDA SDK)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Workload Characterization Describe applications using architecture-independent metrics Control flow, memory intensity/efficiency, data sharing, parallelism Many functions constructed using the trace generator interface

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Activity Factor Fraction of threads active for each dynamic instruction Utilization of CTA-wide SIMD processor Sensitive to reconvergence mechanism

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Interthread Data Flow Which kernels exchange computed results through shared memory? Track id of producer thread Ensure threads are well synchronized Optionally ignore uses of shared memory to transfer working sets

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – Statistical Modeling Data flow correlated to Kernels launched, live registers Kernels imply barriers Data sharing across CTAs Applications that share data do so at all levels of the thread hierarchy

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Architecture Simulation Ocelot PTX emulator has been used as an architecture simulator front-end Can explore impact of architectural changes eg. Instruction fetch and memory scheduling policies MacSim Architecture Simulator (H. Kim-GT) J. Lee, N. B. Lakshminarayana, H. Kim, R.Vuduc, “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications,” MICRO 43, December 2010.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ongoing Work

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: PTX Emulator CUDADatalogOpenCL Correctness & Performance Tools Harmony Runtime Instrumentation database Enterprise SW front- end from LogicBlox Inc. (in progress) Future PlansToday Ocelot Front Ends

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY AMD Backend Goal: Explore CUDA a cross-platform alternative to OpenCL CUDA as a generic programming model Understand the relationships between algorithms and architecture Explore execution and performance portability issues Architecture specific code generation issues, VLIW+SIMD vs. Superscalar+SIMD Unstructured control flow People: Rodrigo Dominguez (rdomingu@ece.neu.edu)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Software Warp-Formation People: Andrew Kerr (arkerr@gatech.edu)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Harmony Runtime: Multiple Devices People: Gregory Diamos (gregory.diamos@gatech.edu)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Contributors Ocelot Team: Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Si Li, Tri Pho, Haicheng Wu, Sudhakar Yalamanchili Special Thanks: Nathan Bell, Sylvan Collange, Jared Hoberook, David Luebke, Ryuta Suzuki, Steve Worley, Bruno Coutinho

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX IR – CPU Backend – Memory Traversal PTX optimizations can change memory access patterns

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory.

Similar presentations

Presentation on theme: "SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory.

Similar presentations

Presentation on theme: "SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory."— Presentation transcript:

Similar presentations

About project

Feedback