Instruction Scheduling for Instruction-Level Parallelism CSS 548 Daniel R. Lewis November 28, 2012
Agenda Where does instruction scheduling fit into the compilation process? What is instruction-level parallelism? What are data dependencies, and how do they limit instruction-level parallelism? How should the compiler order instructions to maximize instruction-level parallelism? What is the affect on register allocation? What else must be considered in instruction scheduling?
Big Picture Instruction scheduling is an optimization that is implemented in the back-end of the compiler Operates on machine code (not IR) Tied to the characteristics of the CPU Assumes generic optimization is complete Idea: reorder instructions to increase instruction-level parallelism.
Instruction-Level Parallelism Parallelism on a single core; not multicore Pipelined processors are executing several instructions at once, at different stages Superscalar and VLIW processors can issue multiple instructions per cycle
Pipelined Parallelism (Jouppi and Wall, 1989) Ubiquitous in modern processors Superpipelining: Longer pipelines with shorter stages (Pentium 4 had 20-stage pipeline)
Superscalar Parallelism (Jouppi and Wall, 1989) Works with CPUs that have multiple functional units (e.g., ALU, multiplier, bit shifter) Since mid-1990s, all general purpose processors have been superscalar (original Pentium was first superscalar x86)
VLIW Parallelism Most commonly seen in embedded DSPs (Jouppi and Wall, 1989) Most commonly seen in embedded DSPs Non-embedded example: Intel Itanium
Data Dependencies inc ebx ;; ebx++ mov eax, ebx ;; eax := ebx Ordering of some instructions must be preserved Three flavors of data dependence: True dependence (read after write) Antidependence (write after read) Output dependence (write after write) Data dependencies substantially reduce available parallelism Dynamically-scheduled processors detect dependencies at run-time and stall instructions until their operands are ready (most processors) Statically-scheduled processors leave dependency detection to the compiler, which must insert no-ops (simple, low-power embedded)
Instruction Scheduling Goal: re-order instructions, accounting for data dependencies and other factors, to minimize the number of stalls/no-ops (Engineering a Compiler, p. 644)
Dependence Graphs and List Scheduling Key data structure is the dependence graph (* List scheduling algorithm *) Cycle := 1 Ready := [leaves of Graph] Active := [] while (Ready + Active).size > 0 for each op in Active if op.startCycle + op.length < Cycle remove op from Active for each successor s of op in Graph if s is ready add s to Ready if Ready.size > 0 remove an op from Ready op.startCycle = Cycle add op to Active Cycle++ (Engineering a Compiler, p. 645–652)
Register Allocation Trade-Offs More storage locations = fewer data dependencies = more parallelism Many register allocation schemes seek to minimize the number of registers used, undermining parallelism Processors developed hardware register renaming as a workaround However, excess register usage may require spillover code which negates the benefit of parallelism Register allocation can be done either before or after instruction scheduling
Advanced Topics List-scheduling algorithm operates on basic blocks Global code scheduling, code motion Software pipelining: schedule entire loop at once Branch prediction Alias analysis: determine if a pointer causes a data dependency Scheduling variable-length operations LOAD can take hundreds or thousands of cycles upon a cache miss Speculative execution
Questions?