Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware Support for Compiler Speculation

Similar presentations


Presentation on theme: "Hardware Support for Compiler Speculation"— Presentation transcript:

1

2 Hardware Support for Compiler Speculation
Compiler needs to move instructions before branch, possibly before condition Requirements: Instructions that can be moved without disrupting data flow Exceptions that can be ignored until outcome is known Ability to speculatively access memory with potential address conflicts

3 Exception Support Four methods:
Hardware and OS cooperate to ignore exceptions for speculative instructions Speculative instructions never raise exceptions; explicit checks must be made Poison bits used to mark registers with invalid results; use causes exception Speculative results are buffered until certain

4 Exception Handling Nonterminating exceptions can be handled normally (e.g. page fault) May cause serious performance loss

5 Memory Reference Speculation
Moving loads across stores is only safe if the addresses do not conflict Special instructions check for address conflicts

6 4.6. Crosscutting Issues: Hardware –vs– Software Speculation
A number of trade-offs and limitations Disambiguating memory references is hard for a compiler Hardware branch prediction is usually better Precise exceptions easier in hardware Hardware does not require “housekeeping” code Compilers can “look” further Hardware techniques are more portable

7 Hardware/Software Speculation
Major disadvantage of hardware: complexity! Some architectures combine hardware and software approaches

8 4.7. Putting It All Together: IA-64 and Itanium
RISC-style Register-register Emphasis on software-based optimisations Features: 128 × 65-bit integer registers 128 × 82-bit FP registers 64 predicate registers; 8 branch registers

9 Registers Integer registers
Use windowing mechanism 0–31 always visible Remainder arranged in overlapping windows Local and out areas (variable size) Hardware for over-/underflow Int and FP registers support register rotation Supports software pipelining

10 Instruction Format and VLIW
Compiler schedules parallel instructions; flags dependences Instruction group Sequence of (register) independent instructions Compiler marks boundaries between groups (stop) Bundle 128-bits: 5-bit template + 3 × 41-bit instructions

11 Instruction Bundle Template specifies stops and execution unit
I-unit (int + special — multimedia, etc.) M-unit (int + memory access) F-unit (FP) B-unit (branches) L+X (extended instructions)

12 Example Unrolled seven times for (int k = 0; k < 1000; k++)
{ x[k] = x[k] + s; } Unrolled seven times Optimised for size: 9 bundles; 15% nops 21 cycles (3 per calculation) Optimised for performance: 11 bundles; 30% nops 12 cycles (1.7 per calculation)

13 Instructions 41-bits long Predication 4-bit opcode (+ template bits)
6-bit predicate register specifier Predication Almost all instructions can be predicated Branch is jump with predicate check! Complex comparisons set two predicate registers

14 Speculation Exceptions can be deferred Speculative loads
Uses poison bits (65-bit registers) Nonspeculative and chk instructions raise exception Speculative loads Called advanced load (ld.a) Stores check addresses

15 Itanium First implementation of IA-64
Issues up to six instructions per cycle (two bundles) Nine functional units 2 × I, 2 × M, 3 × B, 2 × F 10-stage pipeline Multilevel dynamic branch predictor

16 Itanium Complex hardware with many features of dynamically scheduled pipelines! Branch prediction Register renaming Scoreboarding Deep pipeline etc.

17 Itanium: Performance SPECint not too impressive FP better
85% of Alpha (older, more power-efficient processor!) FP better Faster, even with slower clock! But skewed by one benchmark for Pentium Alpha compilers need improvement

18 4.8. Another View: ILP in Embedded Processors
Trimedia (see chapter 2) “Classic” VLIW Hardware decompression of code Crusoe Software translation of 80x86 to VLIW Low power

19 Trimedia TM32 Architecture
VLIW Instruction specifies five operations Static scheduling No hardware hazard detection 23 functional units (11 types)

20 Transmeta Crusoe Low power design Emulates 80x86 VLIW
64-bit (2 op) and 128-bit (4 op) instructions Five types of operations: ALU (int, register-register) Compute (int ALU, FP, multimedia) Memory Branch Immediate

21 Crusoe Simple, in-order pipeline
Integer: 6-stage (IF1, IF2, DEC, OP, EX, WB) FP: 10-stage (5 EX stages)

22 Crusoe Software interpretation of 80x86 code: Basic blocks cached
Exception handling complicated Crusoe has good support for speculative reordering Memory writes buffered and committed only when safe

23 Crusoe Performance Hard to measure accurately
Power consumption is low (⅓ of Pentium)

24 4.9. Fallacies and Pitfalls
Fallacy: There is a simple approach to multiple-issue (high performance with low complexity) Big gap between peak and sustained performance for multiple issue processors Need dynamic scheduling, speculation support, branch prediction, sophisticated prefetch, etc. Sophisticated compilers are required

25 4.10. Concluding Comments “Hardware” techniques migrating to “software” and vice versa Multiprocessors may be important in future

26 Chapter 5 Memory Hierarchy Design

27 Memory Hierarchies Not a new idea!
Takes advantage of the principle of locality Temporal Spatial Small, fast memories close to processor

28 Memory Hierarchies Registers Speed Size Cost Cache Memory
I/O Devices (virtual memory) Speed Cost Size

29 Introduction Usually includes responsibility for memory protection
Performance is a major problem

30 Figure 5.2

31 Characterising Levels of the Memory Hierarchy
Four questions: Where can a block be placed? (placement) How is a block found? (identification) Which block should be replaced on a miss? (replacement) What happens on a write? (write strategy)

32 Example The Alpha is used as an example throughout

33 Caches Where is a block placed in a cache?
Three possible answers  three different types Anywhere Fully associative Only into one block Direct mapped Into subset of blocks Set associative

34 Cache Categories Set associative Direct-mapped Fully associative
n-way set associative, where n is number of blocks in set Commonly, n = 2 or n = 4 Direct-mapped “1-way set associative” Fully associative “m-way set associative” (m is total number of blocks in cache)

35


Download ppt "Hardware Support for Compiler Speculation"

Similar presentations


Ads by Google