Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.

Similar presentations


Presentation on theme: "Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer."— Presentation transcript:

1 Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer Engineering, University of Maryland, College Park {ajaleel, eng.umd.edu

2 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Paper Motivation Maximizing Application ILP: –OoO performance depends on size of instruction window or reorder buffer (ROB) –Improve ILP by larger ROB sizes Before This Paper: –Many studies have showed large performance gains with large ROBs –Most have discounted real effects in memory subystem

3 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Paper Contributions Uncovering A Problem: –Increasing OoO capability degrades memory system performance Increase in replay traps Increase in L1 cache misses The Reason: –OoO scheduler reordering memory instructions The Solution: –Restrict reordering of memory instructions –Virtual Load/Store Queue (VLSQ)

4 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Background – Replay Traps Hardware events to ensure correct execution order of memory instructions Types of Replay Traps –Load-Store Replay Trap –Wrong-Size Replay Trap –Load-Load Replay Trap –Load-Miss Load Replay Trap Load-Store Replay 2. ST BYTE A (3) 3. LD BYTE A (2) 1. LD BYTE A (1) 4. LD BYTE B (4) Wrong Size Replay 2. ST BYTE A (2) 3. LD HALF A (3) 1. LD BYTE A (1) 4. LD BYTE B (4) Load-Miss Load Replay 3. LD BYTE A (3) 2. ST BYTE A (2)1. + LD BYTE A (1) 4. LD BYTE B (4) P2P1 2. ST BYTE A (2) 3. LD BYTE A (1)1. LD BYTE A (4) 4. LD BYTE B (3) P2P1 Load-Load Replay

5 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Experimental Framework Simulator: –Sim-Alpha –64K 2-Way IL1/DL1, 2MB 4-Way L2, 8 MSHRS / cache –Branch predictor: 4K BTB, and 2K hybrid g-share/bimodal –1024-entry store-wait predictor –Hardware data prefetcher: 2-Way 256-entry stride table and eight 8-entry stream buffers –Detailed DDR2 DRAM model with queuing delays Benchmarks –SPEC2000

6 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” The Problem w/↑ OoO Capability Replay Traps: –Trap frequency increases by a factor of 5 –Trap overhead increases by 10-60% L1 Cache Misses: –Number of cache misses increase by 15% (average) –fma3d, mesa, wupwise, eon, vpr, twolf, swim (20% – 40%) Traps / 1000 Instructions ROB-80ROB-512ROB-128ROB-256 % Increase in L1 Cache Misses (compared to ROB 80) ROB-80ROB-512ROB-128ROB-256

7 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Why The Problem? OoO execution reorders both ALU and memory instructions Replay traps and cache misses are problems associated with memory instructions Hypothesis: –Reordering of ALU Instructions poses little or no threats BUT –Reordering of memory instructions causes the problem

8 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” How many are issued in-order? 10 to 20% of memory instructions are issued in order with increased OoO capability Need to reduce reordering of memory instructions 0W-W Distance From Being Issued In Program Order % Memory Instructions 55% 10% 15% 21% Issued LateIssued Early In-order Issue ROB 80 ROB 128 ROB 256 ROB 512

9 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Virtual Load/Store Queue (VLSQ) Traditional LSQ: Any ready instruction is issued Traditional Load/Store Queue MEM 0 MEM 1 MEM 2 MEM 3 MEM 4 MEM 5. MEM N-1 MEM N LSQ HEAD LSQ TAIL ISSUEDREADYNOT READY

10 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Virtual Load/Store Queue (VLSQ) Traditional LSQ: Any ready instruction is issued Virtual LSQ: Only issue instructions residing in a virtual window Traditional Load/Store Queue Virtual Load/Store Queue MEM 0 MEM 1 MEM 2 MEM 3 MEM 4 MEM 5. MEM N-1 MEM N LSQ HEAD LSQ TAIL Virtual Window Size = Inf VIRTUAL HEAD MEM 0 MEM 1 MEM 2 MEM 3 MEM 4. MEM N-1 MEM N LSQ HEAD LSQ TAIL VIRTUAL TAIL MEM 5 Virtual Window Size = 4 ISSUEDREADYNOT READY

11 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Virtual Load/Store Queue (VLSQ) Traditional LSQ: Any ready instruction is issued Virtual LSQ: Only issue instructions residing in a virtual window Traditional Load/Store Queue Virtual Load/Store Queue MEM 0 MEM 1 MEM 2 MEM 3 MEM 4 MEM 5. MEM N-1 MEM N LSQ HEAD LSQ TAIL Virtual Window Size = Inf VIRTUAL HEAD MEM 0 MEM 1 MEM 2 MEM 3 MEM 4. MEM N-1 MEM N LSQ HEAD LSQ TAIL VIRTUAL TAIL MEM 5 Virtual Window Size = 4 ISSUEDREADYNOT READY

12 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Virtual Load/Store Queue (VLSQ) Traditional LSQ: Any ready instruction is issued Virtual LSQ: Only issue instructions residing in a virtual window Virtual window slides down only when instruction at virtual head is issued Traditional Load/Store Queue Virtual Load/Store Queue MEM 0 MEM 1 MEM 2 MEM 3 MEM 4 MEM 5. MEM N-1 MEM N LSQ HEAD LSQ TAIL Virtual Window Size = Inf VIRTUAL HEAD MEM 0 MEM 1 MEM 2 MEM 3 MEM 4. MEM N-1 MEM N LSQ HEAD LSQ TAIL VIRTUAL TAIL MEM 5 Virtual Window Size = 4 ISSUEDREADYNOT READY

13 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” MEM 3 Virtual Load/Store Queue (VLSQ) Traditional LSQ: Any ready instruction is issued Virtual LSQ: Only issue instructions residing in a virtual window Virtual window slides down only when instruction at virtual head is issued Traditional Load/Store Queue Virtual Load/Store Queue MEM 0 MEM 1 MEM 2 MEM 3 MEM 4 MEM 5. MEM N-1 MEM N LSQ HEAD LSQ TAIL Virtual Window Size = Inf VIRTUAL HEAD MEM 0 MEM 1 MEM 2 MEM 4. MEM N-1 MEM N LSQ HEAD LSQ TAIL VIRTUAL TAIL MEM 5 Virtual Window Size = 4 ISSUEDREADYNOT READY

14 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” VLSQs: Replay Trap Stats ↑ OoO Aggressiveness (ROB from 80  512 entries) –5X increase in trap frequency VLSQs reduce trap frequency by factors of 2-30 –25-60% of total execution time spent in traps VLSQs reduce total time handling traps by 10-40% Direct correlation between memory ordering and replay traps ROB-80ROB-512ROB-128ROB-256ROB-80ROB-512ROB-128ROB-256 Replay Traps / 1000 Instructions Replay Trap Penalty Inf

15 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” VLSQs: DL1 Cache Stats ↑ OoO Aggressiveness (ROB from 80  512 entries) –55% Increase in L1 Cache Accesses VLSQs reduce cache accesses by upto 55% –15% Increase in L1 Cache Misses VLSQs reduce cache misses by upto 10% Direct correlation between memory ordering and cache accesses ROB-80ROB-512ROB-128ROB-256ROB-80ROB-512ROB-128ROB-256 Normalized Accesses Normalized Misses Inf

16 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” VLSQ Performance Applications show three different behaviors –Group I: Performance same – non-memory intensive apps –Group II: Performance loss – memory intensive apps –Group III: Performance benefit – alleviating negative effects VLSQ of size 16 or 32 is ideal across all apps Inf VLSQ Sizes ROB-512 CPI MEMORY ALU OTHER GROUP III GROUP II GROUP I

17 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Power Savings with VLSQs Reducing Replay Traps –5-60% power savings in fetch/map/exec hardware Reducing Cache Accesses and Misses –5-65% savings in L1 data cache Savings of 25-30% using VLSQs of 16 or 32 VLSQ 64VLSQ 32 VLSQ 4 VLSQ 16 VLSQ 1VLSQ 8 Execution Units (Normalized to Inf) L1 Cache (Normalized to Inf) ROB 080 ROB 128 ROB 256 ROB 512 VLSQ 64VLSQ 32 VLSQ 4 VLSQ 16 VLSQ 1VLSQ 8

18 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Windowing of Load/Store Queue Static Mechanism (This Study): –Statically set the size of the virtual window –Drawback: Memory ILP lost during execution phase where negative effects do not exist Dynamic Mechanism (Future Work): –Intuition that negative effects do not always exist –Dynamically vary virtual window size based on application execution behavior Virtual window initially infinite Vary window size based on certain thresholds

19 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Summary In This Paper : –Problem: Increasing in replay traps and cache misses –Reason: Reordering of memory instructions –Solution: Virtual Load/Store Queues (VLSQs) Points To Take Home: –Mechanism to improve performance causes degradation in the memory subsystem –OoO cores shouldn’t always be on full throttle –– Because… at times we’ll NEED to tug on the reins

20 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” BACKUP SLIDES THANK YOU!!!!

21 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Agenda Motivation: Why is this study important? Paper Contributions –The Problem –The Reason Background Virtual Load Store Queues (VLSQs) A Limit Study Using VLSQs Summary

22 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Background – Replay Traps Replay traps are hardware enforced to –Force accesses to a particular memory location in order Ensure CORRECT execution Ensure multi-processor memory consistency –Handle different sized accesses to same address Replay traps are NOT related to OS trap events, i.e. no handler support is needed Recovering from a replay trap –Similar to handling branch mispredicts –Pipeline is flushed and execution restarts from the replay trap causing instruction

23 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” OoO Hardware – Background Reorder Buffer (ROB), Issue Queues (Integer or Floating Point), and Load/Store Queues

24 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” The Problem – ↑ L1 Cache Misses Increasing ROB size from 80 to 512 –5–40% increase in L1 cache misses when compared to ROB-80 ROB 128 ROB 256 ROB 512

25 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” The Problem – ↑ Replay Traps Increasing ROB size from 80 to 512 –10–60% increase in replay trap overhead ROB 080 ROB 128 ROB 256 ROB 512

26 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” VLSQ Performance ROB-80ROB-512ROB-128ROB-256ROB-80ROB-512ROB-128ROB-256 ROB-80ROB-512ROB-128ROB-256

27 A. Jaleel and B. Jacob. “Using Virtual Load/Store Queues to Reduce the Negative Effects of Reordered Memory Instructions” Replay Trap Distribution LEGEND: Load-Store Wrong-Size Load-Load Load-Miss Load


Download ppt "Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer."

Similar presentations


Ads by Google