Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instruction Scheduling for Instruction-Level Parallelism

Similar presentations


Presentation on theme: "Instruction Scheduling for Instruction-Level Parallelism"— Presentation transcript:

1 Instruction Scheduling for Instruction-Level Parallelism
CSS 548 Daniel R. Lewis November 28, 2012

2 Agenda Where does instruction scheduling fit into the compilation process? What is instruction-level parallelism? What are data dependencies, and how do they limit instruction-level parallelism? How should the compiler order instructions to maximize instruction-level parallelism? What is the affect on register allocation? What else must be considered in instruction scheduling?

3 Big Picture Instruction scheduling is an optimization that is implemented in the back-end of the compiler Operates on machine code (not IR) Tied to the characteristics of the CPU Assumes generic optimization is complete Idea: reorder instructions to increase instruction-level parallelism.

4 Instruction-Level Parallelism
Parallelism on a single core; not multicore Pipelined processors are executing several instructions at once, at different stages Superscalar and VLIW processors can issue multiple instructions per cycle

5 Pipelined Parallelism
(Jouppi and Wall, 1989) Ubiquitous in modern processors Superpipelining: Longer pipelines with shorter stages (Pentium 4 had 20-stage pipeline)

6 Superscalar Parallelism
(Jouppi and Wall, 1989) Works with CPUs that have multiple functional units (e.g., ALU, multiplier, bit shifter) Since mid-1990s, all general purpose processors have been superscalar (original Pentium was first superscalar x86)

7 VLIW Parallelism Most commonly seen in embedded DSPs
(Jouppi and Wall, 1989) Most commonly seen in embedded DSPs Non-embedded example: Intel Itanium

8 Data Dependencies inc ebx ;; ebx++ mov eax, ebx ;; eax := ebx
Ordering of some instructions must be preserved Three flavors of data dependence: True dependence (read after write) Antidependence (write after read) Output dependence (write after write) Data dependencies substantially reduce available parallelism Dynamically-scheduled processors detect dependencies at run-time and stall instructions until their operands are ready (most processors) Statically-scheduled processors leave dependency detection to the compiler, which must insert no-ops (simple, low-power embedded)

9 Instruction Scheduling
Goal: re-order instructions, accounting for data dependencies and other factors, to minimize the number of stalls/no-ops (Engineering a Compiler, p. 644)

10 Dependence Graphs and List Scheduling
Key data structure is the dependence graph (* List scheduling algorithm *) Cycle := 1 Ready := [leaves of Graph] Active := [] while (Ready + Active).size > 0 for each op in Active if op.startCycle + op.length < Cycle remove op from Active for each successor s of op in Graph if s is ready add s to Ready if Ready.size > 0 remove an op from Ready op.startCycle = Cycle add op to Active Cycle++ (Engineering a Compiler, p. 645–652)

11 Register Allocation Trade-Offs
More storage locations = fewer data dependencies = more parallelism Many register allocation schemes seek to minimize the number of registers used, undermining parallelism Processors developed hardware register renaming as a workaround However, excess register usage may require spillover code which negates the benefit of parallelism Register allocation can be done either before or after instruction scheduling

12 Advanced Topics List-scheduling algorithm operates on basic blocks
Global code scheduling, code motion Software pipelining: schedule entire loop at once Branch prediction Alias analysis: determine if a pointer causes a data dependency Scheduling variable-length operations LOAD can take hundreds or thousands of cycles upon a cache miss Speculative execution

13 Questions?


Download ppt "Instruction Scheduling for Instruction-Level Parallelism"

Similar presentations


Ads by Google