Operation of the Basic SM Pipeline

Operation of the Basic SM Pipeline
©Sudhakar Yalamanchili unless otherwise noted

Objectives Cycle-level examination of the operation of major pipeline stages in a stream multiprocessor A breadth first look at a basic pipeline Understand the type of information necessary for each stage of operation Identification of performance bottlenecks Detailed implementations are addressed in subsequent modules

Objectives Step inside

Reading Documentation for the GPGPUSim simulator Good source of information about the general organization and operation of a stream multiprocessor Operation of a Scoreboard General Purpose Graphics Architectures, T. Aamodt, W. Fung, and T. Rogers, Chapter 2.2

NVIDIA GK110 (Keplar) Thread Block Scheduler Hierarchy of schedulers: kernel, TB, warp, memory transactions Image from

SMX Organization : GK 110 Multiple Warp Schedulers 64K 32-bit registers 192 cores – 6 clusters of 32 cores each What are the main stages of a generic SMX pipeline? Image from

A Generic SM Pipeline Scalar Fetch & Decode
Warp 6 Warp 1 Warp 2 Decode RF PRF D-Cache Data All Hit? Writeback Pending Warps Pipeline scalar pipeline Issue I-Buffer I-Fetch Miss? Scalar Fetch & Decode Instruction Issue & Warp Scheduler Front-end Predicate & GP Register Files Scalar Cores Scalar Pipelines Data Memory Access Back-end Writeback/Commit

Single Warp Execution PC AM WID State warp state Thread Block
setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N @p bra L1; bra L2; L1: ld.global.f32 %f1, [%r6]; //r6 = &a[index] ld.global.f32 %f2, [%r7]; //r7 = &b[index] add.f32 %f3, %f1, %f2; st.global.f32 [%r8], %f3; //r8 = &c[index] L2: ret; PTX (Assembly): Grid

Instruction Fetch & Decode
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Examples from Harmonica2 GPU PC AM WID State Instr Warp 0 PC Warp 1 PC To I-Cache Warp n-1 PC May realize multiple fetch policies Next Warp From GPGPU-Sim Documentation

Instruction Buffer Buffer a fixed number of instructions per warp
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Example: buffer 2 instructions/warp Decoded instruction V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Scoreboard ECE 6100/CS 6290 Buffer a fixed number of instructions per warp Coordinated with instruction fetch Need an empty I-buffer for the warp V: valid instruction in the buffer R: instruction ready to be issued Set using the scoreboard logic From GPGPU-Sim Documentation

Instruction Buffer (2) Scoreboard enforces WAW and RAW hazards
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Scoreboard Scoreboard enforces WAW and RAW hazards Indexed by Warp ID Each entry hosts required registers, Destination registers are reserved at issue Reserved registers released at writeback Enables multiple instructions to be in execution from a single warp From GPGPU-Sim Documentation

Instruction Buffer (3) Next: Modified scoreboard design to address
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Scoreboard Generic Scoreboard dest reg src1 src2 Source Registers have value? Function unit producing value Name Busy Op Fi Fj Fk Qj Qk Rj Rk Int Yes Load F2 R3 No Next: Modified scoreboard design to address Have multiple instructions in transit Excessive demand for register file ports From GPGPU-Sim Documentation

Instruction Issue I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps pool of ready warps Warp 3 Warp 8 Warp 7 instruction Warp Scheduler Manages implementation of barriers, register dependencies, and control divergence From GPGPU-Sim Documentation

Instruction Issue (2) warp I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps barrier Warp 3 Warp 8 Warp 7 instruction Warp Scheduler Barriers – warps wait here for barrier synchronization All threads in the thread block must reach the barrier From GPGPU-Sim Documentation

Instruction Issue (3) I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Scoreboard V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr 1 W2 Warp 3 Warp 8 Warp 7 instruction Warp Scheduler Register Dependencies - track through the scoreboard From GPGPU-Sim Documentation

Instruction Issue (4) Control Divergence - per warp stack
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps divergent warps Warp 3 Warp 8 Warp 7 instruction Keeps track of divergent threads at a branch Warp Scheduler SIMT Stack (per warp) Control Divergence - per warp stack Create execution mask that is read with operands From GPGPU-Sim Documentation

Instruction Issue (5) Scheduler can issue multiple instructions from a warp Issue conditions Has valid instructions Not waiting at a barrier Scoreboard check Pipeline line is not stalled: operand access stage (will get to it later) Reserve destination registers Instructions may issue to memory, SP or SFU pipelines Warp scheduling disciplines  more later in the course

Single ported Register File Banks
Register File Access Banks 0-15 I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Arbiter RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Single ported Register File Banks 1024 bit Xbar From SPs OC OC OC OC Operand Collectors (OC) DU DU DU DU Dispatch Units (DU) ALUs L/S SFU

Scalar Pipeline Functional units are pipelined
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps Functional units are pipelined Designs with multiple issue Dispatch ALU FPU LD/SD Result Queue A Single Core

Shared Memory Access Multiple bank organization
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending warps 2-way Conflict access Conflict free access Multiple bank organization Data is interleaved across banks Bank conflicts extend access times

Memory Request Coalescing
Memory Requests Tid RQ Size Base Add Offset Pending Request Table Memory Address Coalescing Pending RQ Count Addr Mask Thread Masks Pending Request Table (PRT) is filled whenever a memory request is issued Generate a set of address masks  one for each memory transaction Issue transactions From J. Leng et.al., “GPUWattch : Enabling Energy Optimizations in GPGPUs,’ ISCA 2013

Memory Hierarchy Configurable cache/shared memory configuration for L1
warp Configurable cache/shared memory configuration for L1 Read-only cache for compiler or developer (intrinsics) use Shared L2 across all SMXs ECC coverage across the hierarchy Performance impact L1 Cache Shared Memory Read-Only Cache L2 Cache DRAM From GK110: NVIDIA white paper

Summary Synchronous progress of a warp through the SM pipelines
Warp progress in a thread block can diverge for many reasons Barriers Control divergence Memory divergence How is the execution optimized? Next 

Operation of the Basic SM Pipeline

Similar presentations

Presentation on theme: "Operation of the Basic SM Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Operation of the Basic SM Pipeline

Similar presentations

Presentation on theme: "Operation of the Basic SM Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback