Presentation is loading. Please wait.

Presentation is loading. Please wait.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Similar presentations


Presentation on theme: "Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department."— Presentation transcript:

1 Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research

2 L2 Motivation L2 HomogeneousHeterogeneous Adaptive (Federation) Multithreaded scalar IO core 2-way OO core L2

3 Basic Insights A multithreaded in-order core has many registers which can be reused for a reorder buffer or active list A multithreaded in-order core has many registers which can be reused for a reorder buffer or active list If cores are small, single cycle communication between neighbors is feasible If cores are small, single cycle communication between neighbors is feasible Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible

4 Bpred Allocate Rename Issue Commit In-order & Out-of-order Pipelines Fetch Decode Execute Mem Writeback Fetch Decode Execute Mem Writeback In-orderOut-of-order

5 Ready BitsSubscriber Slot 1Subscriber Slot 2 1 2 3 4 5 Issue Queue Example 11IQ2 1 IQ3 0 00 1 1 + + + 1 Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002 Sassone et al., Sassone et al., Matrix Scheduler Reloaded, ISCA 2007 1 2 3

6 Simplified Load-Store Queue Memory Alias Table (MAT) Memory Alias Table (MAT) No store forwarding No store forwarding No conservative waiting on stores No conservative waiting on stores Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Amir Roth, Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005

7 MAT Example st 0x13, r5 ld r1, 0x13 0 0 0 0 0 0 0 0 MAT 0 1 2 3 4 5 6 7

8 MAT Example st 0x13, r5 ld r1, 0x13EXE 0 0 0 1 0 0 0 0 MAT 0 1 2 3 4 5 6 7 ld executes and increments counter

9 MAT Example st 0x13, r5COM 0 0 0 1 ! 0 0 0 0 MAT 0 1 2 3 4 5 6 7 ld r1, 0x13 st commits and sets flag

10 MAT Example ld r1, 0x13COM 0 0 0 1 ! 0 0 0 0 MAT 0 1 2 3 4 5 6 7 Flush ld commits, sees flag, and flushes pipeline

11 MAT Example ld r1, 0x13 0 0 0 0 0 0 0 0 MAT 0 1 2 3 4 5 6 7 MAT is reset and execution resumes

12 Performance Impact

13 Performance

14 Energy Efficiency

15 Area Efficiency

16 Conclusions Two in-order cores can be federated at run-time to form a 2-way OO core Two in-order cores can be federated at run-time to form a 2-way OO core Almost doubling IPC of throughput core is possible with very little extra hardware Almost doubling IPC of throughput core is possible with very little extra hardware Don’t want traditional OO structures because their performance comes at too high a price Don’t want traditional OO structures because their performance comes at too high a price Best combined area- and energy-efficiency Best combined area- and energy-efficiency

17 Q & A

18 Backup

19 Core Fusion Data Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors”, ISCA 2007

20 Overall Results Scalar in-order core is 8KB I/D, 256KB L2 Scalar in-order core is 8KB I/D, 256KB L2 Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred

21 Branch Prediction Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) NLS ok if your instruction working set not > I$ size NLS ok if your instruction working set not > I$ size Small bimodal predictor ik ok for small window processor Small bimodal predictor ik ok for small window processor

22 Fetch Two I$’s act as a I$ of twice the size and associativity (and random replacement) Two I$’s act as a I$ of twice the size and associativity (and random replacement) More logic and buffers to capture two instructions More logic and buffers to capture two instructions Extra cycle to route instructions from two I$’s to two decoders Extra cycle to route instructions from two I$’s to two decoders

23 Decode Cancel second instruction if first turns out to be branch Cancel second instruction if first turns out to be branch Extra cycle to route decoded instructions to new allocate stage Extra cycle to route decoded instructions to new allocate stage

24 Allocate New logic and free lists to allocate ROB, IQ entries New logic and free lists to allocate ROB, IQ entries

25 Rename New table since it has too many ports New table since it has too many ports One, centralized rename table, not distributed One, centralized rename table, not distributed Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue) Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue)

26 Issue Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) Centralized, one IQ for the two cores Centralized, one IQ for the two cores

27 Register File Register file is mirrored in the two cores Register file is mirrored in the two cores No extra copy instructions or load-balancing questions No extra copy instructions or load-balancing questions

28 Execute Add extra cycle for copying result to other core’s register file (like EV6) Add extra cycle for copying result to other core’s register file (like EV6)

29 Memory Access The two D$s are checked in parallel, each responsible for half of the merged D$’s ways The two D$s are checked in parallel, each responsible for half of the merged D$’s ways No standard LSQ, only a Memory Alias Table (details later) No standard LSQ, only a Memory Alias Table (details later) Only detects ordering violations and send signal to pipeline Only detects ordering violations and send signal to pipeline

30 Commit Centralized commit, no slippage Centralized commit, no slippage Recover from branch mispredictions since no checkpoints of RAT on branches Recover from branch mispredictions since no checkpoints of RAT on branches Recover from memory order violations (or false positives) from MAT Recover from memory order violations (or false positives) from MAT


Download ppt "Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department."

Similar presentations


Ads by Google