Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University.

Similar presentations


Presentation on theme: "Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University."— Presentation transcript:

1 Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY

2 Motivation Single-thread performance still important design goal Branch mispredictions and cache misses are still important performance hurdles One effective approach: Decoupled Look-ahead Look-ahead thread can often become the new bottleneck Speculative parallelization aptly suited for its acceleration More parallelism opportunities: Look-ahead thread does not contain all dependences Simple architectural support: Look-ahead not correctness critical Single-thread performance still important design goal Branch mispredictions and cache misses are still important performance hurdles One effective approach: Decoupled Look-ahead Look-ahead thread can often become the new bottleneck Speculative parallelization aptly suited for its acceleration More parallelism opportunities: Look-ahead thread does not contain all dependences Simple architectural support: Look-ahead not correctness critical 1

3 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 2

4 Baseline Decoupled Look-ahead Binary parser is used to generate skeleton from original program The skeleton runs on a separate core and Maintains its own memory image in local L1, no write-back to shared L2 Sends all branch outcomes through FIFO queue and also helps prefetching Binary parser is used to generate skeleton from original program The skeleton runs on a separate core and Maintains its own memory image in local L1, no write-back to shared L2 Sends all branch outcomes through FIFO queue and also helps prefetching 3 *A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Intl Symp. On Microarch., November 2008

5 Practical Advantages of Decoupled Look-ahead Look-ahead thread is a single, self- reliant agent No need for quick spawning and register communication support Little management overhead on main thread Easier for run-time control to disable Natural throttling mechanism to prevent run-away prefetching Look-ahead thread size comparable to aggregation of short helper threads Look-ahead thread is a single, self- reliant agent No need for quick spawning and register communication support Little management overhead on main thread Easier for run-time control to disable Natural throttling mechanism to prevent run-away prefetching Look-ahead thread size comparable to aggregation of short helper threads 4

6 Look-ahead: A New Bottleneck 5 Comparing four systems Baseline Decoupled look-ahead Ideal Look-ahead alone Application categories Bottlenecks removed Speed of look-ahead not the problem Look-ahead is the new bottleneck Comparing four systems Baseline Decoupled look-ahead Ideal Look-ahead alone Application categories Bottlenecks removed Speed of look-ahead not the problem Look-ahead is the new bottleneck

7 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 6

8 Unique Opportunities Skeleton code offers more parallelism Certain dependencies removed during slicing for skeleton Look-ahead is inherently error- tolerant Can ignore dependence violations Little to no support needed, unlike in conventional TLS Skeleton code offers more parallelism Certain dependencies removed during slicing for skeleton Look-ahead is inherently error- tolerant Can ignore dependence violations Little to no support needed, unlike in conventional TLS 7 Assembly code from equake

9 Software Support 8 Dependence analysis Profile guided, coarse-grain at basic block level Spawn and Target points Basic blocks with consistent dependence distance of more than threshold of D MIN Spawned thread executes from target point Loop level parallelism is also exploited Dependence analysis Profile guided, coarse-grain at basic block level Spawn and Target points Basic blocks with consistent dependence distance of more than threshold of D MIN Spawned thread executes from target point Loop level parallelism is also exploited Parallelism at basic block level Available parallelism for 2 core/contexts system Spawn Target Time

10 Parallelism in Look-ahead Binary 9 Available theoretical parallelism for 2 core/contexts system; D MIN = 15 BB Parallelism potential with stables and consistent target and spawn points

11 Hardware and Runtime Support 10 Thread spawning and merging Not too different from regular thread spawning except Spawned thread shares the same register and memory state Spawning thread will terminate at the target PC Value communication Register-based naturally through shared registers in SMT Memory-based communication can be supported at different levels Partial versioning in cache at line level Thread spawning and merging Not too different from regular thread spawning except Spawned thread shares the same register and memory state Spawning thread will terminate at the target PC Value communication Register-based naturally through shared registers in SMT Memory-based communication can be supported at different levels Partial versioning in cache at line level Spawning of a new look-ahead thread

12 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 11

13 Experimental Setup Program/binary analysis tool: based on ALTO Simulator: based on heavily modified SimpleScalar SMT, look-ahead and speculative parallelization support True execution-driven simulation (faithfully value modeling) Program/binary analysis tool: based on ALTO Simulator: based on heavily modified SimpleScalar SMT, look-ahead and speculative parallelization support True execution-driven simulation (faithfully value modeling) 12 Microarchitectural and cache configurations

14 Speedup of speculative parallel decoupled look-ahead 13 14 applications in which look-ahead is bottleneck Speedup of look-ahead systems over single thread Decoupled look-ahead over single thread baseline: 1.61x Speculative look-ahead over single thread baseline: 1.81x Speculative look-ahead over decoupled look-ahead:1.13x 14 applications in which look-ahead is bottleneck Speedup of look-ahead systems over single thread Decoupled look-ahead over single thread baseline: 1.61x Speculative look-ahead over single thread baseline: 1.81x Speculative look-ahead over decoupled look-ahead:1.13x

15 Speculative Parallelization in Look- ahead vs Main thread 14 Skeleton provides more opportunities for parallelization Speedup of speculative parallelization Speculative look-ahead over decoupled LA baseline: 1.13x Speculative main thread over single thread baseline: 1.07x Skeleton provides more opportunities for parallelization Speedup of speculative parallelization Speculative look-ahead over decoupled LA baseline: 1.13x Speculative main thread over single thread baseline: 1.07x IPC Core1 Core2 Baseline (ST) Baseline TLS Decoupled LA Speculative LA LA 1.07x 1.13x LA0LA1 1.81x

16 Flexibility in Hardware Design 15 Comparison of regular (partial versioning) cache support with two other alternatives No cache versioning support Dependence violation detection and squash Comparison of regular (partial versioning) cache support with two other alternatives No cache versioning support Dependence violation detection and squash

17 Other Details in the Paper The effect of spawning look-ahead thread in preserving the lead of overall look-ahead system Technique to avoid spurious spawns in dispatch stage which could be subjected to branch misprediction Technique to mitigate the damage of runaway spawns Impact of speculative parallelization on overall recoveries Modifications to branch queue in multiple look-ahead system The effect of spawning look-ahead thread in preserving the lead of overall look-ahead system Technique to avoid spurious spawns in dispatch stage which could be subjected to branch misprediction Technique to mitigate the damage of runaway spawns Impact of speculative parallelization on overall recoveries Modifications to branch queue in multiple look-ahead system 16

18 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 17

19 Summary Decoupled look-ahead can significantly improve single-thread performance but look-ahead thread often becomes new bottleneck Look-ahead thread lends itself to TLS acceleration Skeleton construction removes dependences and increases parallelism Hardware design is flexible and can be a simple extension of SMT A straightforward implementation of TLS look-ahead achieves an average 1.13x (up to 1.39x) speedup Decoupled look-ahead can significantly improve single-thread performance but look-ahead thread often becomes new bottleneck Look-ahead thread lends itself to TLS acceleration Skeleton construction removes dependences and increases parallelism Hardware design is flexible and can be a simple extension of SMT A straightforward implementation of TLS look-ahead achieves an average 1.13x (up to 1.39x) speedup 18

20 Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Backup Slides http://www.ece.rochester.edu/projects/acal/

21 References (Partial) J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Intl Conf. on Supercomputing, pages 68–75, July 1997. O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proc. Intl Symp. on High-Perf. Comp. Arch., pages 129–140, February 2003. J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Intl Symp. On Comp. Arch., pages 14–25, June 2001. C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Intl Symp. on Comp. Arch., pages 2–13, June 2001. A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Intl Symp. on High-Perf. Comp. Arch., pages 37–48, January 2001. Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Intl Symp. on Microarch., pages 269–280, December 2000. G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Intl Symp. on Comp. Arch., pages 414–425, June 1995. J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Intl Conf. on Supercomputing, pages 68–75, July 1997. O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proc. Intl Symp. on High-Perf. Comp. Arch., pages 129–140, February 2003. J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Intl Symp. On Comp. Arch., pages 14–25, June 2001. C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Intl Symp. on Comp. Arch., pages 2–13, June 2001. A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Intl Symp. on High-Perf. Comp. Arch., pages 37–48, January 2001. Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Intl Symp. on Microarch., pages 269–280, December 2000. G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Intl Symp. on Comp. Arch., pages 414–425, June 1995. 20

22 References (cont…) H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In Proc. Intl Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005. A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Intl Symp. On Microarch., November 2008. C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Intl Symp. on Microarch., pages 85–96, Nov 2002. D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997. J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread- Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006. P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009. M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In Proc. Intl Symp. on Comp. Arch., pages 13–24, June 2000. H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In Proc. Intl Conf. on Parallel Arch. and Compilation Techniques, pages 231–242, September 2005. A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Intl Symp. On Microarch., November 2008. C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Intl Symp. on Microarch., pages 85–96, Nov 2002. D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June 1997. J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread- Level Speculation on a CMP. IEEE Micro, 26(1):80–91, January/February 2006. P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages 410–420, June 2009. M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In Proc. Intl Symp. on Comp. Arch., pages 13–24, June 2000. 21

23 Backup Slides For reference only

24 Register Renaming Support 23

25 Modified Banked Branch Queue 24

26 Speedup of all applications 25

27 Impact on Overall Recoveries 26

28 Partial Recovery in Look-ahead 27 When recovery happens in primary look-ahead Do we kill the secondary look-ahead or not? If we dont kill we gain around 1% performance improvement After recovery in main, secondary thread survives often (1000 insts) Spawning of a secondary look-ahead thread helps preserving the slip of overall look-ahead system When recovery happens in primary look-ahead Do we kill the secondary look-ahead or not? If we dont kill we gain around 1% performance improvement After recovery in main, secondary thread survives often (1000 insts) Spawning of a secondary look-ahead thread helps preserving the slip of overall look-ahead system

29 Quality of Thread Spawning Successful and run-away spawns: killed after certain time 28 Impact of lazy spawning policy Wait for few cycles when spawning opportunity arrives Avoids spurious spawns; Saves some wrong path execution Impact of lazy spawning policy Wait for few cycles when spawning opportunity arrives Avoids spurious spawns; Saves some wrong path execution


Download ppt "Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University."

Similar presentations


Ads by Google