Presentation is loading. Please wait.

Presentation is loading. Please wait.

Milad Hashemi, Onur Mutlu, Yale N. Patt

Similar presentations


Presentation on theme: "Milad Hashemi, Onur Mutlu, Yale N. Patt"— Presentation transcript:

1 Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads
Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016

2 Continuous Runahead Outline
2 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions

3 Continuous Runahead Outline
3 Continuous Runahead Outline Overview of Runahead Runahead Limitations Continuous Runahead Dependence Chains Continuous Runahead Engine Continuous Runahead Evaluation Conclusions

4 Runahead Execution Overview
4 Runahead Execution Overview Runahead dynamically expands the instruction window when the pipeline is stalled [Mutlu et al., 2003] The core checkpoints architectural state The result of the memory operation that caused the stall is marked as poisoned in the physical register file The core continues to fetch and execute instructions Operations are discarded instead of retired The goal is to generate new independent cache misses

5 Traditional Runahead Accuracy
5 Traditional Runahead Accuracy

6 Traditional Runahead Accuracy
6 Traditional Runahead Accuracy Runahead is 95% Accurate

7 Traditional Runahead Prefetch Coverage
7 Traditional Runahead Prefetch Coverage

8 Traditional Runahead Prefetch Coverage
8 Traditional Runahead Prefetch Coverage Runahead has only 13% Prefetch Coverage

9 Traditional Runahead Performance Gain
9 Traditional Runahead Performance Gain

10 Traditional Runahead Performance Gain
10 Traditional Runahead Performance Gain Runahead has a 12% Performance Gain Runahead Oracle has an 85% Performance Gain

11 Traditional Runahead Interval Length
11 Traditional Runahead Interval Length

12 Traditional Runahead Interval Length
12 Traditional Runahead Interval Length Runahead Intervals are Short Low Performance Gain

13 Continuous Runahead Challenges
13 Continuous Runahead Challenges Which instructions to use during Continuous Runahead? Dynamically target the dependence chains that lead to critical cache misses What hardware to use for Continuous Runahead? How long should chains pre-execute for?

14 Dependence Chains 14 LD [R3] -> R5 ADD R4, R5 -> R9
Cache Miss LD [R6] -> R8

15 Dependence Chain Selection Policies
15 Dependence Chain Selection Policies Experiment with 3 policies to determine the best policy to use for Continuous Runahead: PC-Based Policy Use the dependence chain that has caused the most misses for the PC that is blocking retirement Maximum Misses Policy Use a dependence chain from the PC that has generated the most misses for the application Stall Policy Use a dependence chain from the PC that has caused the most full- window stalls for the application

16 Dependence Chain Selection Policies
16 Dependence Chain Selection Policies

17 Why does Stall Policy Work?
17 Why does Stall Policy Work?

18 19 PCs cover 90% of all Stalls
18 Why does Stall Policy Work? 19 PCs cover 90% of all Stalls

19 Constrained Dependence Chain Storage
19 Constrained Dependence Chain Storage

20 Storing 1 Chain provides 95%
20 Constrained Dependence Chain Storage Storing 1 Chain provides 95% of the Performance

21 Continuous Runahead Chain Generation
21 Continuous Runahead Chain Generation Maintain two structures: 32-entry cache of PCs to track the operations that cause the pipeline to frequently stall The last dependence chain for the PC that has caused the most full-window stalls At every full window stall: Increment the counter of the PC that caused the stall Generate a dependence chain for the PC that has caused the most stalls

22 Runahead for Longer Intervals
22 Runahead for Longer Intervals

23 CRE Microarchitecture
23 CRE Microarchitecture No Front-End No Register Renaming Hardware 32 Physical Registers 2-Wide No Floating Point or Vector Pipeline 4kB Data Cache

24 Dependence Chain Generation
24 Dependence Chain Generation Cycle: 1 2 4 3 5 ADD P > P1 ADD E > E3 Search List: P7 P1 P9, P1 P3 P2 SHIFT P1 -> P9 SHIFT E3 -> E4 Register Remapping Table: ADD P9 + P1 -> P3 ADD E4 + E3 -> E2 EAX EBX ECX Core Physical Register P7 P1 P8 P2 P3 P9 SHIFT P3 -> P2 SHIFT E2 -> E1 CRE Physical Register E5 E3 E0 E2 E1 E4 LD [P2] -> P8 LD [E1] -> E0 MAP E3 -> E5 First CRE Physical Register E3 E0 E1

25 Dependence Chain Generation
25 Dependence Chain Generation

26 26 Interval Length

27 System Configuration Single-Core/Quad-Core Caches
27 System Configuration Single-Core/Quad-Core 4-wide Issue 256 Entry Reorder Buffer 92 Entry Reservation Station Caches 32 KB 8-Way Set Associative L1 I/D-Cache 1MB 8-Way Set Associative Shared Last Level Cache per Core Non-Uniform Memory Access Latency DDR3 System 256-Entry Memory Queue Batch Scheduling Prefetchers Stream, Global History Buffer Feedback Directed Prefetching: Dynamic Degree 1-32 CRE Compute 2-wide issue 1 Continuous Runahead issue context with a 32-entry buffer and 32-entry physical register file 4 kB Data Cache

28 Single-Core Performance
28 Single-Core Performance

29 21% Single Core Performance Increase over prior State of the Art
29 Single-Core Performance 21% Single Core Performance Increase over prior State of the Art

30 Single-Core Performance + Prefetching
30 Single-Core Performance + Prefetching

31 Single-Core Performance + Prefetching
31 Single-Core Performance + Prefetching Increases Performance over and In-Conjunction with Prefetching

32 Independent Miss Coverage
32 Independent Miss Coverage

33 Independent Miss Coverage
33 Independent Miss Coverage 70% Prefetch Coverage

34 34 Bandwidth Overhead

35 Low Bandwidth Overhead
35 Bandwidth Overhead Low Bandwidth Overhead

36 Multi-Core Performance
36 Multi-Core Performance

37 43% Weighted Speedup Increase
37 Multi-Core Performance 43% Weighted Speedup Increase

38 Multi-Core Performance + Prefetching
38 Multi-Core Performance + Prefetching

39 13% Weighted Speedup Gain over
39 Multi-Core Performance + Prefetching 13% Weighted Speedup Gain over GHB Prefetching

40 Multi-Core Energy Evaluation
40 Multi-Core Energy Evaluation

41 Multi-Core Energy Evaluation
41 Multi-Core Energy Evaluation 22% Energy Reduction

42 42 Conclusions Runahead prefetch coverage is limited by the duration of each runahead interval To remove this constraint, we introduce the notion of Continuous Runahead We can dynamically identify the most critical LLC misses to target with Continuous Runahead by tracking the operations that cause the pipeline to frequently stall We migrate these dependence chains to the CRE where they are executed continuously in a loop

43 Conclusions Continuous Runahead greatly increases prefetch coverage
43 Conclusions Continuous Runahead greatly increases prefetch coverage Increases single-core performance by 34.4% Increases multi-core performance by 43.3% Synergistic with various types of prefetching

44 44 Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19th, 2016


Download ppt "Milad Hashemi, Onur Mutlu, Yale N. Patt"

Similar presentations


Ads by Google