Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Similar presentations


Presentation on theme: "CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998."— Presentation transcript:

1 CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998

2 Leveraging SMT Recall branch fan-out from “Limits of ILP” Future processors will likely have no shortage of idle thread contexts Spawned threads are parallel, but have dependences with earlier instructions: registers, uncommitted stores, data cache values SMT may be an ideal candidate as threads share the same set of resources

3 SMT Vs. CMP A multi-threaded workload (on an SMT) is more tolerant of branch mpreds – TME makes most sense if there is a shortage of threads Power overheads are enormous – on an SMT, we may not have the option to execute speculative threads on low-power pipelines What about energy? Is CMP a better candidate?

4 Renaming Overview r1 maps to p1 r1  …  r1 br …. r1  p5  …  p5 br …. p3  Every branch causes a checkpoint of mappings, so we can recover quickly on a mis-predict Each thread in the SMT can have 8 checkpoints

5 Threaded Multi-Path Execution Key elements in TME: Identifying low-confidence branches Efficient thread spawning Efficient recovery on branch resolution Fetch priorities for each thread on SMT

6 Path Selection Only the primary path can spawn threads (prevents an exponential increase in threads) For each bpred entry, keep track of successive correct predictions (reset on mispredict) – if the counter is less than a threshold, the branch is low-confidence – note that a small counter size is more selective in picking low-confidence branches

7 Register Mappings In SMT, each thread can read any physical register Thread spawning requires a copy of the register mappings at that branch A copy involves transfer of (32 x 9 bits) – the new thread cannot begin renaming until this copy is complete – the copy may also hold up the primary thread if map table read ports are scarce Every new mapping can be placed on a bus and idle threads can snoop and keep pace

8 Spawning Algorithm

9 When threads are idle, they keep pace and spawn a thread as soon as a low-confidence branch is encountered When a thread context becomes free and a low-confidence checkpoint already exists, the new context synchronizes mappings with the primary context and executes the primary path, while the old primary context executes the alternate path after reinstating the checkpoint If a newly idle thread has a low-confidence checkpoint, it starts executing the alternate path

10 Introduced Complexity Book-keeping to manage checkpoint locations – every branch has to track the location of its checkpoint Who frees a register value? What about memory dependences?  Loads can ignore stores that are not predecessors  Maintain an array of bits to represent the path taken (each basic block corresponds to a bit in the array)  Check for memory dependences only if the store’s path is a subset of the load’s path (p5) r1  (p7) r1  (p8) r1 

11 Processor Parameters Eight-wide processor with up to eight contexts; each context has eight checkpoints 32-entry issue queues, 4Kb gshare branch predictor, 7 cycle mpred penalty, memory latency of 62 cycles ICOUNT 2.8: first thread can bring in up to 8 instrs and the second thread fills in unused slots; occupancy in the front-end determines priority Focus on branch-limited programs: compress (20%), gcc (18%), go (30%), li (6%)

12 Results: Spare Contexts

13 Results: Bus Latency

14 Results: Branch Confidence

15 Results: Path Selection

16 Results: Fetch Policy

17 Results: Mpred Penalty

18 Conclusions Too much complexity/power overhead, too little benefit? Benefits may be higher for deeper pipelines; larger windows (this paper evaluates 8 windows of 48 instrs; does 2 x 192 yield better results?); longer memory latencies There is room for improvement with better branch confidence metrics CMPs will incur greater cost during thread spawning, but may be more power-efficient

19 Title Bullet


Download ppt "CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998."

Similar presentations


Ads by Google