Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras.

Similar presentations


Presentation on theme: "Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras."— Presentation transcript:

1 Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras Institute for Information Technology Uppsala University

2 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Decoupling for Energy Efficiency Motivation: range of voltage scaling shrinking  Can’t rely on DVFS any longer to provide quadratic energy decrease for at most linear performance decrease Goal: minimize performance degradation  Ideally: change DVFS per instruction  Low frequency when waiting for memory  Max frequency when computing  Impractical: requires instantaneous DVFS on cache misses Our solution: split the program into Access (prefetch) and Execute (compute)  DAE (Decoupled A-E)

3 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se DAE for Energy Efficiency Our solution: split the program into Access (prefetch) and Execute (compute)  DAE (Decoupled A-E) Access:  Prefetch data into lower-level cache  Eliminate most LLC & TLB misses  Memory waiting in access phase  Run at: low-f  maximize energy savings Execute:  Computation and stores  If access is successful  not many stalls in execute  Run at: high-f  maximize performance

4 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se How do we do it Parallel Workloads  Use all cores to do useful work Task-based programming model. Why?  Schedule tasks independently  Control task size  Easy to convert to DAE! How? Split each task into access phase and execute phase Access phase:  Remove stores  Remove arithmetic computation  Keep (replicate) address calculation  Turn loads into prefetches Execute phase: Original (unmodified) task  Scheduled to run on same core right after Access DVFS Access and Execute independently Low f High f

5 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Estimate Access/Execute Energy Estimate Access/Execute Energy Limitations Current processors: Do not yet support low-latency, per-core DVFS Our task-based runtime supports per task-phase DVFS but impractical to do it for real for small tasks (w.s. size ~ L2) Modeling DAE for future processors:  Per-core DVFS (on-chip voltage regulator)  Reduce DVFS overhead 50×  How do we do this? Run-time Statistics Run-time Statistics Time IPC Power Model Power Model Estimate Accuracy Verify

6 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Estimate per phase energy Power = f ×V 2 ×A×C Profiling for each f and V: IPC A IPC X Time A Time X E A = f min × V min 2 × C eff (IPC A ) × Time A E X = f max × V max 2 × C eff (IPC X ) × Time X Now we can model per-core instantaneous DVFS A×C : C eff  measured as a function of IPC Effective Capacitance IPC

7 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Understanding DAE Time(sec) Energy(Joule) CoupledDecoupledCoupledDecoupled Execute phase f max  f min Execute phase f max  f min Coupled f max  f min Coupled f max  f min Access phase f min Access phase f min

8 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Understanding DAE Time(sec) Energy(Joule) CoupledDecoupledCoupledDecoupled Performance is unaffected Performance is unaffected Energy is 25% reduced Energy is 25% reduced

9 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Three Experiments Coupled at Optimal EDP f,V Decoupled Naïve  Access at f min  Execute at f max Decoupled Optimal EDP  Access at optimal f opt (for Access)  Execute at optimal f opt (for Execute) All results are Normalized to Coupled at f max

10 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Evaluation: Coupled at Optimal EDP G.Mean Good EDP Bad Performance Good EDP Bad Performance Overall Slowdown : ≈ 12% EDP Improvement: ≈ 22% Overall Slowdown : ≈ 12% EDP Improvement: ≈ 22% Normalized Time Normalized EDP

11 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Evaluation: Decoupled Naïve G.Mean No Slowdown DAE can improve EDP over coupled DAE can improve EDP over coupled Normalized Time Normalized EDP Better EDP No slowdown Better EDP No slowdown

12 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Evaluation: Decoupled Opt. EDP G.Mean DAE Opt.EDP can further improve EDP on slight performance impact DAE Opt.EDP can further improve EDP on slight performance impact Normalized Time Normalized EDP Even Better EDP Performance Decrease Even Better EDP Performance Decrease

13 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Conclusions Separating execute and access enables optimal DVFS Deliver f max performance at optimal EDP

14 Informationsteknologi Institutionen för informationsteknologi | www.it.uu.se Questions Thank You!


Download ppt "Towards Power Efficiency on Task-Based, Decoupled Access-Execute Models Konstantinos Koukos David Black-Schaffer Vasileios Spiliopoulos Stefanos Kaxiras."

Similar presentations


Ads by Google