Presentation is loading. Please wait.

Presentation is loading. Please wait.

-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.

Similar presentations


Presentation on theme: "-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha."— Presentation transcript:

1 -1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha Nath VLSI CAD LABORATORY, UC San Diego

2 -2- Outline Motivation Motivation Previous Work Previous Work Our Work Our Work Problem Formulation Problem Formulation Optimal (Discretized) Solution Flow Optimal (Discretized) Solution Flow Results Results Conclusions Conclusions

3 -3- Reliability in MultiCore Systems Modern multicore processors operate at multiple operating modes Modern multicore processors operate at multiple operating modes –E.g., nominal, supply voltage scaling, turbo, etc. Reliability is a key processor design consideration at leading-edge technology nodes to guarantee a prescribed system lifetime Reliability is a key processor design consideration at leading-edge technology nodes to guarantee a prescribed system lifetime Task scheduling affects how cores are used Task scheduling affects how cores are used –A subset of cores can fail before others

4 -4- Scheduling in Multicore Systems Scheduler packs tasks using some or all the available processing cores Scheduler packs tasks using some or all the available processing cores 1 1 1 2 2 3 4 4 Application B Application A Time #Cores

5 -5- Core Wearout Mean time to failure (MTTF) is a measure of the lifetime of a core Mean time to failure (MTTF) is a measure of the lifetime of a core Reliability mechanisms degrade MTTF of a core Reliability mechanisms degrade MTTF of a core –E.g., electromigration (EM), stress migration, hot carrier injection, bias temperature instability, etc. When all cores are not simultaneously active When all cores are not simultaneously active –Adjust task scheduling on a subset of active cores for balanced wearout

6 -6- Impact of Overdrive Frequency Frequency due to overclocking the cores to meet performance and throughput requirements Frequency due to overclocking the cores to meet performance and throughput requirements Overdrive frequencies cause faster MTTF degradation Overdrive frequencies cause faster MTTF degradation Two challenges Two challenges –Can violate “acceptable throughput” for tasks Cores fail before all assigned tasks are completed Cores fail before all assigned tasks are completed –Can violate minimum “acceptable performance” for tasks Cores operate at lower frequencies Cores operate at lower frequencies

7 -7- Terminology

8 -8- Outline Motivation Motivation Previous Work Previous Work Our Work Our Work Problem Formulation Problem Formulation Optimal (Discretized) Solution Flow Optimal (Discretized) Solution Flow Results Results Conclusions Conclusions

9 -9- Classification of Existing Works WorkType Reiss12NRC, NLG, NPG Karpuzcu09RC, NLG, NPG Mihic04RC, LG (Dynamic power management), NPG Rosing07RC, LG (Dynamic power management), NPG Rong06RC, LG (Dynamic power management), NPG Coskun09RC, LG (Dynamic thermal management), NPG Srinivasan04RC, LG (Dynamic reliability management), NPG Karl08RC, LG (Dynamic reliability management), NPG (N)RC – (Non-) Reliability Constrained (N)LG – (No) Lifetime Guarantee (N)PG – (No) Performance Guarantee

10 -10- Counterexample to NRC Policies Task schedule Task schedule Max frequency = 3GHz Max frequency = 3GHz Min acceptable frequency = 1.8GHz Min acceptable frequency = 1.8GHz Initial lifetime = 7 years (61320h) Initial lifetime = 7 years (61320h) #Active cores (m) Nominal execution time (AF = 1) Overdrive execution time (AF = 9.77) 11000h3000h 22000h5000h 33000h8000h 42000h5000h All cores operate always at 3GHz All cores operate always at 3GHz –From HotSpot simulations, AF = 9.77 Lifetime after nominal tasks requiring m = 3 is 24947.5h Lifetime after nominal tasks requiring m = 3 is 24947.5h –Tasks requiring m = 3 cannot complete overdrive execution –Tasks requiring m = 4 cannot complete at all Cannot guarantee “acceptable throughput” !!!

11 -11- Counterexample to RC-LG Policies Task schedule Task schedule Max frequency = 3GHz Max frequency = 3GHz Min acceptable frequency = 1.8GHz Min acceptable frequency = 1.8GHz Initial lifetime = 61320h Initial lifetime = 61320h #Active cores (m) Nominal execution time (AF = 1) Overdrive execution time (AF = 9.77) 11000h3000h 22000h5000h 33000h8000h 42000h5000h All cores operate initially at 3GHz, and then at 1.6GHz All cores operate initially at 3GHz, and then at 1.6GHz –From HotSpot simulations, AF = 9.77 All tasks are completed but All tasks are completed but –Tasks requiring m = 3, 4 operate at 1.6GHz < 1.8GHz (acceptable performance) !!! Cannot guarantee “acceptable performance” !!!

12 -12- Outline Motivation Motivation Previous Work Previous Work Our Work Our Work Problem Formulation Problem Formulation Optimal (Discretized) Solution Flow Optimal (Discretized) Solution Flow Results Results Conclusions Conclusions

13 -13- What Do We Do Differently? We formulate a new Maximum-Value Reliability- Constrained Overdrive Frequencies (MVRCOF) optimization (offline) problem We formulate a new Maximum-Value Reliability- Constrained Overdrive Frequencies (MVRCOF) optimization (offline) problem Important because Important because –Overdrive frequencies are our optimization variables –User experience is the value We guarantee prescribed levels of “acceptable performance” and “acceptable throughput” We guarantee prescribed levels of “acceptable performance” and “acceptable throughput”

14 -14- Comparison of Ours vs. Existing Works WorkType Reiss12NRC, NLG, NPG Karpuzcu09RC, NLG, NPG Mihic04RC, LG (Dynamic power management), NPG Rosing07RC, LG (Dynamic power management), NPG Rong06RC, LG (Dynamic power management), NPG Coskun09RC, LG (Dynamic thermal management), NPG Srinivasan04RC, LG (Dynamic reliability management), NPG Karl08RC, LG (Dynamic reliability management), NPG Our WorkRC, LG (Dynamic reliability management, PG (N)RC – (Non-) Reliability Constrained (N)LG – (No) Lifetime Guarantee (N)PG – (No) Performance Guarantee

15 -15- What is the Optimal Solution? Task schedule Task schedule Max frequency = 3GHz Max frequency = 3GHz Min acceptable frequency = 1.8GHz Min acceptable frequency = 1.8GHz Initial lifetime = 61320h Initial lifetime = 61320h #Active cores (m) Nominal execution time (AF = 1) Overdrive execution time (AF = 9.77) 11000h3000h 22000h5000h 33000h8000h 42000h5000h Optimal (discretized) solution from exhaustive search Optimal (discretized) solution from exhaustive search #Active cores (m) Nominal frequency Overdrive frequency 11.5GHz2.85GHz 21.5GHz2.3GHz 31.5GHz1.8GHz 41.5GHz1.8GHz We guarantee both “acceptable performance” and “acceptable throughput” if a solution exists!!!

16 -16- Our Key Contributions We develop a new MVRCOF formulation to maximize the value of operating multiple cores at overdrive frequencies We develop a new MVRCOF formulation to maximize the value of operating multiple cores at overdrive frequencies Our solutions provide guarantees for prescribed lower bounds on “acceptable performance” and “acceptable throughput” Our solutions provide guarantees for prescribed lower bounds on “acceptable performance” and “acceptable throughput” We propose optimal (discretized) solution using exhaustive search as well as an approximate heuristic flow We propose optimal (discretized) solution using exhaustive search as well as an approximate heuristic flow Our solutions determine optimal overdrive frequencies as well as execution times for each active core Our solutions determine optimal overdrive frequencies as well as execution times for each active core We empirically determine that our optimal solutions improve the objective function value by up to 17.4% versus existing works We empirically determine that our optimal solutions improve the objective function value by up to 17.4% versus existing works

17 -17- Outline Motivation Motivation Previous Work Previous Work Our Work Our Work Problem Formulation Problem Formulation Optimal (Discretized) Solution Flow Optimal (Discretized) Solution Flow Results Results Conclusions Conclusions

18 -18- Formulation

19 -19- Formulation In English

20 -20- Formulation In English Guarantees “acceptable throughput”, i.e., all tasks complete within lifetime and cores wearout in a balanced manner Upper bound on instantaneous power dissipated by any core Upper bound on instantaneous temperature of all actives cores

21 -21- MVRCOF Inputs: Task Description App 1 App 2 App X Scheduler E l,m w l,m f nom,m Execution times in nominal and overdrive modes with different number of active cores Weights in nominal and overdrive modes with different number of active cores Nominal frequencies at different number of active cores

22 -22- MVRCOF Inputs: System Description SoC Designer N P max f max T max T nom MTTF Number of available symmetric cores Maximum power of any core Maximum frequency of any core Maximum die temperature Nominal temperature Initial MTTF of any core

23 -23- MVRCOF Outputs MVRCOF solver f OD,m v j,m,l u i,l Optimal overdrive frequencies for each set of active cores %lifetime each core operates at nominal and overdrive modes

24 -24- MVRCOF Inputs and Outputs App 1 App 2 App X Scheduler SoC Designer N P max f max T max T nom MTTF E l,m w l,m f nom,m System Description Task Description MVRCOF solver f OD,m v j,m,l u i,l Outputs

25 -25- Outline Motivation Motivation Previous Work Previous Work Our Work Our Work Problem Formulation Problem Formulation Optimal (Discretized) Solution Flow Optimal (Discretized) Solution Flow Results Results Conclusions Conclusions

26 -26- Optimal (Discretized) Solution Flow

27 -27- Heuristic Flow

28 -28- Outline Motivation Motivation Previous Work Previous Work Our Work Our Work Problem Statement Problem Statement Optimal (Discretized) Solution Flow Optimal (Discretized) Solution Flow Results Results Conclusions Conclusions

29 -29- Experimental Setup Each core is simulated with 72 copies of jpeg_encoder from OpenCores Each core is simulated with 72 copies of jpeg_encoder from OpenCores –SP&R implementation with commercial tools and foundry 45nm libraries Power simulation using Synopsys PrimeTime-PX Power simulation using Synopsys PrimeTime-PX –Increase voltage from 0.8V to 1.2V in steps of 10mV –Increase frequency from 1.5GHz to 3GHz in steps of 50MHz Thermal simulation using HotSpot Thermal simulation using HotSpot LP solver is lp_solve LP solver is lp_solve Baseline policy is RC-LG from existing works Baseline policy is RC-LG from existing works

30 -30- Testcases Name (Kh) 4-I1, 2 3, 4 1, 2 3, 2 3, 5 8, 5 0.5, 0.3 0.2, 0.4 0.5, 0.7 0.8, 0.6

31 -31- Optimal, Heuristic vs. RC-LG -12% -9% sw

32 -32- Runtime Comparison

33 -33- Outline Motivation Motivation Previous Work Previous Work Our Work Our Work Problem Statement Problem Statement Optimal (Discretized) Solution Flow Optimal (Discretized) Solution Flow Results Results Conclusions Conclusions

34 -34- Conclusions We formulate and solve a new MVRCOF problem under lifetime reliability constraints We develop MVRCOF solver that implements our optimal (discretized) and heuristic flows Our optimal solutions guarantee both “acceptable performance” and “acceptable throughput” We empirically demonstrate that our optimal solutions achieve up to 17.4% greater value of the objective function than existing works Our future works include – –Application of our methods to traces from actual server workloads – –Expand our methods to handle other objectives – –Achieve solutions that are temperature history-aware

35 -35- Thank You!

36 -36- Back up

37 -37- Notation

38 -38- Optimal Solution Flow f OD,m Power(f OD,m ) Power simulation Thermal simulation (f OD,m, temp, AF) LUT (m, j)Core Temp f OD,m AF Exhaustive Search For each core i, f OD,m and combination j of m Optimal obj fn value, f OD,m and t j,m,l LP 1


Download ppt "-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha."

Similar presentations


Ads by Google