A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha Ranganathan (HP Labs) Dean Tullsen (UCSD)

Motivation  Power is an important issue for processors  Going up every successive generation (with complexity) -Up to 150W for Alpha 21464!

Past Techniques for Power Reduction  Voltage/frequency scaling Limitation: Limited by technology. Also, not possible below a certain feature-size.  Architectural Adaptation -shut off portions of core when not needed -dynamic speculation control -reconfigurable caches Limitations: -Very few choices to make -Only dynamic power being saved -Has associated overhead

Single-ISA Heterogeneous Multi-Core Architectures

Our Proposal Have multiple heterogeneous cores on the same die Match workload (or workload phase) to core that achieves best efficiency according to some objective function (Ensure that the new core has acceptable performance) Power down the unused cores

Motivation  Hypotheses  Performance difference between cores varies based on workload or workload phases  Different cores have varying relative energy efficiencies for the same workload  Implication: possibility of dynamically changing “best” core

Goals of the Paper  Validate the hypotheses  Get an idea of the design space  Get an idea of the potential benefits

Outline of Talk  Motivation  Past Work  Our Work Assumptions Decisions Methodology  Results and Conclusions  Summary and Future Work

Choice of Cores on the Die Five Cores on the die: In-order: QED R4700, EV4(Alpha 21064), EV5(Alpha 21164) Out-of-order: EV6 (Alpha 21264),"EV8-“ All cores assumed to be without L2-cache. “EV8-”: Issue width is same as EV8(Alpha 21464) - Resources reduced to account for a single thread. - Core-power dissipation: 100W

Properties of the Cores ProcessorR4700EV4EV5EV6EV8- Issue-width1246(OOO)8(OOO) I-Cache2-way 16KBDM, 8KB 2-way 64KB4-way 64KB D-Cache2-way 16KBDM, 8KB 2-way 64KB4-way 64KB Branch Pred. No2KB/1-bit2K-gshareHybrid 2-level MSHR124816 Notice the gradation!

Properties of Cores (contd.)  Assume all cores implemented in 0.1um -Scaled area and power accordingly  Clock Speed? -All Alpha cores assumed to run at 2.1GHz (EV6 frequency at 0.10 micron) -R4700 assumed to run at 1GHz

Core Power and Area  peak power of core estimated from data sheets - minus that used by L2 caches and pins - then scaled for.1um process  area of core estimated from die photos - minus that of i/o pad, wires, L2 cache & control - then scaled for.1um process  L2 cache area and power - estimated using CACTI

Core Power and Area (contd.) ProcessorCore-power (in W)Core-area (in mm^2) R47000.45 3 EV44.97 3 EV59.83 5 EV617.80 24 EV8-92.88 260 EV8- consumes 200 times more power than R4700! It is more than 85 times bigger too!

Core Power and Area (contd.)

Methodology  Simulator used: SMTSIM  ROB-size, Activelist-size and Load-store queue always kept big enough to ensure no conflicts.  Benchmarks used: 14 chosen randomly out of SPEC2000 suite  Fast-forwarded for 2 billion instructions, simulated for 1 billion instructions.  Data collected after every 1 million instructions.

Validating Hypotheses Performance difference between cores varies based on workload or workload phases (IPS) Different cores have varying relative energy efficiencies for the same workload (IPS/W)

Performance Variation with Time Ah! Those clear, distinct phases!

Variation of Energy Efficiency with Time Power dominates IPS/W numbers!

How does a composite objective function fare?

Energy-delay Product Profile

So why not run on the “best” core at all points of time??

Choosing Dynamically the Core with Best Energy-Delay Product (perf. loss<50%) Notice the regions where best-path is not along the best energy-delay product!

Choosing Dynamically the Core with Best Energy-Delay product (perf. loss<50%) [Summary of Results] Energy-Delay Savings(%) Performance Degradation(%) Maximum97.98.5 Minimum0.1 Mean65.418.2 Number of Switchings: Maximum=387(art) Minimum=0 Median=1

Dissecting the Results  More improvements possible – locally-best decisions not necessarily globally-best there was a performance constraint choice of cores not the best for this objective-function cache-configurations not necessarily the best  Even for present improvements, beats voltage scaling handsomely(44.2% ED 2 improvement)

Conclusion  Enormous potential for power-savings  No leakage-power solution  Does considerable IP reuse  Complexity-appropriate - every application match to the “appropriate” complexity core

Tip of the iceberg? Current/Future Work  Cores can be non-ordered  Some cores can be multithreaded  Throughput impact of the architecture

Questions?

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

Similar presentations

Presentation on theme: "A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.

Similar presentations

Presentation on theme: "A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha."— Presentation transcript:

Similar presentations

About project

Feedback