Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

Similar presentations


Presentation on theme: "Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh"— Presentation transcript:

1 Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

2 University of Manchester Context and Motivation  Multi-cores are here to stay and many-cores are coming  Excellent performance for embarrassingly parallel or throughput workloads. Otherwise,... Core Duo (Yonah ) Core i7 (Lynnfield ) SCC (2015?)

3 University of Manchester Context and Motivation  Future many-cores will have many idle cores too often –Not enough applications –Not enough benefit from using more “explicit user threads”  Our proposal: use spare cores to accelerate whatever threads are available –Create implicit threads to run in parallel with main explicit user threads –Accelerate user threads through increased coarse-grain overlap (i.e., TLP) or increased fine-grain overlap (i.e., ILP) –Combine previously proposed speculative multithreading techniques: thread-level speculation (TLS), helper threads (HT), run-ahead execution (RA), and multi-path execution (MP)

4 University of Manchester Context and Motivation  Why is combining SM schemes a good idea? –No speculative multithreading scheme alone is good enough –Hardware support for all schemes is very similar  Expected end-result? –Better performance –More “effective” many-core experience:  With 4-way speculative multithreading (i.e., 4 implicit SM threads for each explicit user thread) a 64 core “unwieldy” many-core is as “easy to handle” as a 16 core system  How about power efficiency? –Speculation can be made less inefficient (we’re working on it) –Power can be smartly allocated (see our IPDPS’10 paper)

5 University of Manchester Contributions  Introduce mixed Speculative Multithreading (SM) Execution Models  Design and evaluated two combinations: TLS+HT+RA [ICS’09] and TLS+MP [HPCA’10]  Propose a performance model able to quantify ILP and TLP benefits  Combined approaches outperform state-of-the-art SM models: –TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and RA by 18.3 % avg. (up to 35.2%) –TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by 28.2 % avg. (up to 138%)

6 University of Manchester Outline  Introduction  Speculative multithreading models  Combined TLS+HT+RA scheme  Combined TLS+MP scheme  Performance model  Experimental setup and results  Conclusions

7 Speculative Multithreading  Basic Idea: Use idle cores/contexts to speculate on future application needs –TLS: speculatively execute parallel threads –HT/RA: speculatively perform future memory operations –MP: speculatively execute along multiple branch targets  Speculative threads supported in hardware  Compiler support not essential, but can be useful  Hardware infrastructure is very similar University of Manchester

8 8 Thread Level Speculation  Compiler deals with: –Task selection –Code generation  HW deals with: –Different context –Spawn threads –Detecting violations –Replaying –Arbitrate commit  Benefit: TLP/ILP –TLP (Overlapped Execution)  + ILP (Prefetching)

9 Helper Threads  Compiler deals with: –Memory ops miss/ hard- to-predict branches –Backward slices  HW deals with: –Spawn threads –Different context –Discard when finished  Benefit: –ILP (Prefetch/Warmup) University of Manchester

10 RunAhead Execution  Compiler deals with: –Nothing  HW deals with: –Different context –When to do RA –VP Memory –Commit/Discard  Benefit: –ILP (Prefetch/Warmup) University of Manchester

11 MultiPath Execution  Compiler deals with: –Nothing  HW deals with: –Different context –When to do MP –Discard wrong path  Benefit: –ILP (Branch Pred.) 11University of Manchester Main Thread Time Correct Paths Wrong Paths Branch Misp. Cost

12 University of Manchester Outline  Introduction  Speculative multithreading models  Combined TLS+HT+RA scheme  Combined TLS+MP scheme  Performance model  Experimental setup and results  Conclusions

13 University of Manchester Combining TLS, HT and RA  Start with TLS  Provide support to clone TLS threads and convert them to HT  Conversion to HT means: –Put them in RA mode –Suppress squashes and do not cause additional squashes –Discard them when they finish  No compiler slicing  purely HW approach

14 Intricacies to be Handled  HT may not prefetch effectively!  Dealing with contention –HT threads much faster  saturate BW  Dealing with thread ordering –TLS imposes total thread order –HT killed  squashes TLS threads University of Manchester

15 Creating and Terminating HT  Create a HT on a L2 miss we can VP –Use mem. address based confidence estimator –VP only if confident  Create a HT if we have a free processor  Only allow most speculative thread to clone –Seamless integration of HT with TLS –BUT: if parent no longer the most spec. TLS thread, the HT has to be killed  Additionally kill HT when: –Parent/HT thread finishes –HT causes exception University of Manchester

16 University of Manchester Outline  Introduction  Speculative multithreading models  Combined TLS+HT+RA scheme  Combined TLS+MP scheme  Performance model  Experimental setup and results  Conclusions

17 Mixed Execution Model  When idle resources: –Try MP on top of TLS!!  Map TLS threads on empty cores  Map MP threads on empty contexts (same core)  Minimal extra HW: –Branch confidence estimator –MP bit – thread on MP mode –PATHS – how many outstanding branches –DIR – which path thread followed 17University of Manchester

18 Combined TLS/MP Model 18University of Manchester Thread 1 Thread 2 Speculative Time

19 Combined TLS/MP Model 19University of Manchester Thread 1 Thread 2 Speculative Time Low Confidence Branch Thread 1 MP: 0 PATHS: 000 DIR: 000

20 Combined TLS/MP Model 20University of Manchester Thread 1a Thread 2 Speculative Time Multi-Path Mode Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 1 PATHS: 001 DIR: 001 Thread 1b

21 Combined TLS/MP Model 21University of Manchester Thread 1a Thread 2 Speculative Time Branch Resolved Thread 1b Thread 1a MP: 1 PATHS: 001 DIR: 000 Thread 1b MP: 0 PATHS: 000 DIR: 000

22 Intricacies to be Handled  How do we map TLS/MP threads? –Different mapping policies for TLS threads  Dealing with thread ordering –Correct data forwarding  Dealing with violations –While in “MP-Mode” delay restarts/kills/commits –No squashes on the wrong path  Thread spawning: –Delayed as well – keep contention low 22University of Manchester

23 23 Outline  Introduction  Speculative multithreading models  Combined TLS+HT+RA scheme  Combined TLS+MP scheme  Performance model  Experimental setup and results  Conclusions

24 University of Manchester Understanding Performance Benefits  Complex TLS thread interactions, obscure performance benefits  Even more true for mixed execution models  We need a way to quantify ILP and TLP contributions to bottom-line performance  Proposed model: –Able to break benefits in ILP/TLP contributions

25 Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) University of Manchester Tseq/Tmt

26 Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) University of Manchester Tseq/T1p

27 Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) University of Manchester (T1+T2)/(T1’+T2’)

28 Performance Model  Sall = Sseq x Silp x Sovl 1.Compute overall speedup (Sall) 2.Compute sequential TLS speedup (Sseq) 3.Compute speedup due to ILP (Silp) 4.Use everything to compute TLP (Sovl) University of Manchester Sall/(Sseq x Silp)

29 University of Manchester Outline  Introduction  Speculative multithreading models  Combined TLS+HT+RA scheme  Combined TLS+MP scheme  Performance model  Experimental setup and results  Conclusions

30 University of Manchester Experimental Setup  Simulator, Compiler and Benchmarks: –SESC (http://sesc.sourceforge.net/)http://sesc.sourceforge.net/ –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int.  Architecture: (for TLS+HT+RA scheme) –Four way CMP, 4-Issue cores –16KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches –Inst. window/ROB – 80/104 entries –16KB Last Value Predictor

31 31 Experimental Setup  Simulator, Compiler and Benchmarks: –SESC (http://sesc.sourceforge.net/)http://sesc.sourceforge.net/ –POSH (Liu et al. PPoPP ‘06) –Spec 2000 Int.  Architecture: (for TLS+MP scheme) –Four way CMP, 4-Issue cores, 6 contexts / core –32K-bit OGEHL, 1KByte BTB, 32-Entry RAS –8 Kbit enhanced JRS confidence estimator –32KB L1 Data (multi-versioned) and Instruction Caches –1MB unified L2 Caches University of Manchester

32 32 Results I TLS + HT + RA

33 University of Manchester Comparing TLS, RunAhead and Unified Scheme

34 University of Manchester Comparing TLS, RunAhead and Unified Scheme  Almost additive benefits

35 University of Manchester Comparing TLS, RunAhead and Unified Scheme  Almost additive benefits  10.2% over TLS, 18.3% over RA

36 Understanding the extra ILP  Improvements of ILP come from: –Mainly memory –Branch prediction (improvement 0.5%)  Focus on memory: –Miss rate on committed path –Clustering of misses (different cost) University of Manchester

37 Normalized Shared Cache Misses  All schemes better than sequential  Unified 41% better than sequential University of Manchester

38 Isolated vs. Clustered Misses.  Both TLS + RA  Large window machines  Unified does even better University of Manchester

39 University of Manchester Results II TLS + MP

40 Impact of Branch Prediction on TLS  TLS emulates wider processor: –Removing mispredictions important (Amdahl) 40University of Manchester

41 Branch Entropy for TLS  Much harder for TLS: –History partitioning –History re-order 41University of Manchester

42 Increasing the Size of the Branch Predictor  Aliasing not much of a problem  Fundamental limitation is lack of history 42University of Manchester

43 Designing a Better Predictor  Predictors that exploit longer histories not necessarily better.. 43University of Manchester

44 44 Comparing TLS, MP and Combined TLS/MP University of Manchester

45 45 Comparing TLS, MP and Combined TLS/MP  Additive benefits; no point in doubling the predictor University of Manchester

46 46 Comparing TLS, MP and Combined TLS/MP  Additive benefits; no point in doubling the predictor  9.2% over TLS, 28.2% over MP University of Manchester

47 Pipeline Flushes  Significant amount of flush reductions  More than base MP! 47University of Manchester

48 48 Outline  Introduction  Speculative multithreading models  Combined TLS+HT+RA scheme  Combined TLS+MP scheme  Performance model  Experimental setup and results  Conclusions

49 Also in the ICS’09 paper …  Dealing with the load of the system  Converting TLS threads to HT  Multiple HT  Effect of a better VP  Detailed comparison of performance model against existing models (Renau et. al ICS ’05) University of Manchester

50 Also in the HPCA’10 paper …  Detailed HW description  Impact of scheduling  Limiting MP to DP  Effect of scaling  Effect of a better CE 50University of Manchester

51 51 Conclusions  CMPs are here to stay: – What about single threaded apps. and apps with significant seq. sections? –We advocate the use of speculative multithreading  Different apps. require different SM techniques –Even within apps. different phases  We propose the first mixed execution model –TLS is nicely complemented by HT, RA, and MP  Combined approaches outperform existing SM models: –TLS+HT+RA: TLS by 10.2% avg. (up to 41.2%) and RA by 18.3 % avg. (up to 35.2%) –TLS+MP: TLS by 9.2% avg. (up to 23.2%) and MP by 28.2 % avg. (up to 138%)

52 Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh

53 Backup Slides University of Manchester

54 University of Manchester Effect of Prefetching  Our HTs do a better job than an aggressive prefetcher!  Prefetching helps our scheme as much as it helps base TLS

55 System Utilization of Base TLS  90% of the time TLS uses 1 or 2 cores University of Manchester

56 Hardware Cost  Last Value predictor – 16KB –Can be made smaller, although it helps a lot  Confidence Estimator – 2Kb –Helps mainly on data dependent branches  Extra bit in thread context information – 1bit/thread University of Manchester

57 Prediction Stats University of Manchester Stat. (%)Bzip2CraftyGapGzipMcfParserTwolfVortexVprAvg. Misp PVN PVP SPEC SENS


Download ppt "Mixed Speculative Multithreaded Execution Models Marcelo Cintra University of Edinburgh"

Similar presentations


Ads by Google