Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dept. of Electrical & Computer Engineering Self-Morphing Cores for Higher Power Efficiency and Improved Resilience Nithesh Kurella, Sudarshan Srinivasan.

Similar presentations


Presentation on theme: "Dept. of Electrical & Computer Engineering Self-Morphing Cores for Higher Power Efficiency and Improved Resilience Nithesh Kurella, Sudarshan Srinivasan."— Presentation transcript:

1 Dept. of Electrical & Computer Engineering Self-Morphing Cores for Higher Power Efficiency and Improved Resilience Nithesh Kurella, Sudarshan Srinivasan Israel Koren, Sandip Kundu Dept. Of Electrical & Computer Engineering University of Massachusetts at Amherst

2 2 Outline  Asymmetric Multicores – benefits and limitations  Challenge: Match the capabilities of the core to current computational needs of applications  Approach: Self-morphing core that can adapt faster to application demands  Main goal: Higher power efficiency with limited or no decrease in performance  Secondary goal: Improve resilience to soft errors

3 3 Asymmetric Multicore Processors (AMPs)  Cores of different computing capabilities  Different performance and power characteristics  Typically consists of Out-of-order (OOO) cores (High performance core) In-Order (InO) cores (Low performance & Energy efficient core) Core 1 Core 2 Asymmetric multicore OOO InOrder

4 4 Commercial ARM Big/Little Architecture  Use the right processor for the right task

5 5 Limitations of current AMP Architectures (1) Limited architectural flexibility Limited choices of core capabilities (2) Fixed number of large and small cores (2) Limited thread to core mapping flexibility Applications have phases (time periods) with different computational requirements Swapping threads between cores can reduce the power consumed, but Task migration has a high overhead ( need to transfer thread state/data)  Thread migration at granularity of millions of instructions (coarse grain) Core 1 Core 2 Thread swapping L1 cache L2 cache Thread 1 Thread 2

6 6 Can we benefit from more than two core types? Performance/Watt of 10 benchmarks

7 7 Applications can benefit from fine-grain assignment ~1000s of instructions [Lukefahr et al. Micro’12] Temporal (fine-grain) variations for sjeng

8 8 Can we exploit Fine-Grain Changes in demands? Develop a fine grain application to core assignment to improve power efficiency without high migration overhead  A self-morphing core: can morph into two (or more) architecture types (core modes) with varying execution width and resource sizes Significantly lower thread migration overhead Critical units (register file, caches and branch predictor) are common to all core modes

9 9 Morphable Architectures  A Morphable architecture where OOO core turns into InO was proposed by [Lukefahr et al., Micro 2012]  InO has much lower power consumption, but Turning OOO core into InO in run time involves significant micro-architecture changes Higher design cost and verification  Questions to be investigated: 1. Are two architecture modes (core types) sufficient to match the large variance in application needs? 2. Is an InO mode necessary, as its inclusion complicates the design?

10 10 Designing a Self-Morphing Core  Goal: Design a core than can morph into various OOO modes with varying execution width and resource sizes  Questions: How many core modes should we have? What should be the architectural parameters of these modes? How fine-grained should mode switches be? Under what conditions should we switch from one mode to another?

11 11 Core Design Space Exploration  Find core types that would provide best performance/Watt at fine-grain instruction granularities  About 11,000 design combinations - pruned to 300  Pruning: group structures to be resized concurrently rather than independent structure resizing

12 12 Number and Types of Cores (out of 300)  Objective: achieve the highest IPS 2 /Watt by allowing switching between core types at ~2K instruction granularity IPS 2 /Watt is used instead of IPS/Watt to emphasize performance  Best core configuration selected from 300 candidates for each 2K instruction interval based on IPS 2 /Watt  Need to to determine IPS 2 /Watt improvement threshold (for a mode switch) A threshold of  20% yields 10 core types with only 15% improvement A higher threshold reduced the number to 4 with a higher IPS 2 /Watt improvement  Fixed number of core types to 4

13 13 Core Types obtained 128 1.6 Frequency and ROB size analysis for IPS 2 /watt for the Average (AC) core Avg. Power consumed

14 14 How to decide on a mode switch?  Switching decision between modes is based on IPS 2 /Watt  To compute IPS 2 /Watt, we need to estimate performance and power Hardware performance counters (PMCs) are used to estimate performance and power at fine-grain granularity  Need to estimate power and performance on the currently active mode as well as 3 other core modes

15 15 Power & IPC Estimation based on PMCs 1.Identified 14 counters that are correlated with performance & power 2. Selected the smallest number of counters that have the highest possible correlation coefficient R 2 4.Regression analysis power of a given core type = f(chosen counters) IPC of a given core type = f(chosen counters) Explored HPCs Stalls (S) # Fetched instructions (F) # Branch mispredictions (BMP) L1 hit (L1h) L1 miss (L1 miss) L2 hit (L2h) L2 miss (L2m) TLB miss (TLB m) # retired INT instructions (INT) # retired FP instructions (FP) # retired Ld instructions (Ld) # retired St instructions (St) # retired Branch instructions (Br) IPC speculative Hit/Miss Retired Explored PMCs

16 16 Selection of PMCs PMC AC => Power NC, denotes using the counters of the average core to estimate the power on the narrow core

17 17 Obtained Power and IPC expressions

18 18 Average Estimation Error using PMCs PMC AC => Power/IPC denotes the average error in estimating power and IPC for the 3 other core types using the PMCs of the average core (AC) Maximum average error of 16%; adequate accuracy

19 19 Frequency of core mode switching  Power and performance are monitored every “window” of instructions (size to be determined)  Want to identify stable phase and prevent frequent morphing during a transient behavior  Wait for several windows (“history_depth”) and follow the most frequent recommendation  Morphing decision based on behavior during the last n retired instructions, where n=Window_Size x History_Depth

20 20 Frequency of morphing decisions  Window_size=500 and history_depth=4 combination yields the highest IPS 2 /Watt Decision to reconfigure (or not) is taken every 2K instructions

21 21 Overhead for Morphing: DVFS  Morphing from one core mode to another involves frequency change  Traditional DVFS is applied at coarse grain granularity (millions of cycles) due to its high overhead  Use of On-Chip regulator reduces the time needed for scaling voltage to hundreds of cycles  The selected fine-grain DVFS technique (upon mode switch) has an overhead of 200 cycles

22 22 Overall morphing overhead  In-core morphing retains processor state and cache content register file, caches and branch predictor are shared  Still additional cycles are needed upon reconfiguration Powering off/on fetch, decode units Power gating banks of ROB, RAT and LSQ units (10 clock cycles to power off one bank) Pipeline drain on every core mode switch The fine grain DVFS overhead is 200 cycles  On Average morphing overhead takes 500 cycles. Exact overhead determined in run-time depending on mode we morph into

23 23 Results and Analysis  Gem5 simulator; McPAT to measure power  We evaluate our proposed scheme with SPEC2006 and SPEC2000 benchmark suites  Benchmarks ran for 2 billion with skipping the first 2 billion instructions

24 24 Time spent in core modes Each of the 4 modes is useful (>40%) for several benchmarks

25 25 Improvement in IPS 2 /Watt Average IPS 2 /Watt improvement of 36% vs. AC baseline

26 26 The self-morphing scheme provides 33% energy savings compared to the AC core Unconstrained Energy savings of self-morphing

27 27 IPC of switching schemes (with AC as baseline) Unconstrained

28 28 Impact on IPC of morphing scheme overhead Morphing overhead increase from 500 to 1000 cycles reduces the average performance of FineGrain_PMC by only 3.5%

29 29 Comparison to Big-Little (OOO-InOrder): IPS 2 /Watt

30 30 Is adding an InOrder core necessary?

31 31 Impact of self-morphing on core reliability Raw SER – Soft Error Rate independent of design details Common reliability metric: Vulnerability to soft errors or Soft Error Rate (SER) AVF – Architectural Vulnerability Factor

32 32 Performance/Watt vs SER tradeoff Minimize RPE – Reliability Power_Efficiency  SER is the lowest for the SM core  Tradeoff between SER and IPS 2 /Watt  Use a Cobb-Douglas type production function a+b=1

33 33 Decrease in SER, increase in IPS/Watt and IPC

34 34 Conclusions  Thread migration/core-hopping is expensive Performed at coarse grain  In-core morphing preserves state & cache Minimal overhead Allowing frequent morphing - larger gains  Achieving average improvement in IPS 2 /Watt of 36%, power savings of 33% and performance increase of 16%  If reliability is also a design objective in-core morphing can achieve 11% SER reduction for a lower (24%) IPS/Watt improvement and a lower (6%) performance increase

35 35 Backup Slides

36 36 Error in estimation of AVF based on PMCs PMCs AC PMCs NC PMCs LW PMCs SM

37 37 Counter selection heuristic Input: PMCs & Power/IPC trace (of representative workloads) Objective: Minimum no. of PMCs to adequately fit power and IPC Metric: Correlation coefficient, R 2, of the fit (higher the better)  Approach: − Search counter space (14) iteratively − Each iteration: Choose a new counter that best fits IPC/Power trace along with previously selected counters − Plot R 2 coefficient − Best set of counters in region where R 2 saturates

38 38 High Branch mispredicition rate Execution of astar – SM preferred

39 39 Execution of mcf – NC preferred Memory-bound application w/high L 2 miss rate

40 40 CPU-bound applications w/high resource use Execution of bzip2 – LW preferred

41 41 Power Constrained core designs  Core types for a 2W power constraint:  Core types for a 1.5W power constraint:

42 42 Comparing Power Constrained and Un-Constrained cores Considerable energy saving vs. AC core type Unconstrained

43 43 Error distribution Distribution of error in estimating IPC in various core types using PMCs of narrow core (NC) Deviation of errors from mean is low with up to 80% between +/- 10% from the mean

44 44 Number of switches in self-morphing scheme Average number of switches is 1250 in 100M instructions Every 2K instructions there is a 25% probability to perform a mode switch

45 45 IPC of switching schemes (with AC as baseline) Unconstrained

46 46 Morphable architecture With/Without InO core 3-mode morphing scheme with the inclusion of InO provides only 2% extra IPS 2 /Watt benefit compared to the 2-mode morphing between OOO(AC)-OOO(SM)

47 47 Mediabench/MyBench

48 48 (2) Is InO mode necessary? InO core has smaller cache and array structures with much lower static power Cache/Array leakage is no longer a problem as tri-gates cut leakage by 10X at 22nm Use instead a small OOO: Fetch, issue width of 1 and smaller ROB, LSQ and IQ For most benchmarks IPC/Watt of InO and small OOO are comparable Simulation with MCPAT 22nm tri-gate models

49 49 Effective Soft Error Rate (SER) Reduction

50 50 Microarchitecture of Morphable Core  IQ, ROB, LSQ are resized dynamically when morphing from one core type to another  ROB, LSQ and IQ are implemented as banked structures  Resizing involves turning on/off banks  Reduce/increase fetch width, Power-off/on half the decoders


Download ppt "Dept. of Electrical & Computer Engineering Self-Morphing Cores for Higher Power Efficiency and Improved Resilience Nithesh Kurella, Sudarshan Srinivasan."

Similar presentations


Ads by Google