© 2010 IBM Corporation What computer architects need to know about memory throttling WEED 2010 June 20, 2010 IBM Research – Austin Heather Hanson Karthick.

Slides:



Advertisements
Similar presentations
Benchmarking for Power and Performance
Advertisements

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
© 2009 IBM Corporation1 Feedback Directed Dynamic Recompilation for Statically Compiled Languages Dorit Nuzman, Sergei Dyshel, Revital Eres IBM Research,
Chapter 11 – Virtual Memory Management
© 2009 IBM Corporation IBM Energy and Environment Gareth Knight, Global Services Business Information Executive, Office of the CIO 18 th January 2010.
Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.
DSPs Vs General Purpose Microprocessors
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Measuring and Modeling Hyper-threaded Processor Performance Ethan Bolker UMass-Boston September 17, 2003.
Computer Performance CS350 Term Project-Spring 2001 Elizabeth Cramer Bryan Driskell Yassaman Shayesteh.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
1 The Problem of Power Consumption in Servers L. Minas and B. Ellison Intel-Lab In Dr. Dobb’s Journal, May 2009 Prepared and presented by Yan Cai Fall.
Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
User Side Factors. Download Speed Download speed from a user’s side, is how long it takes a webpage to load, once requested. The measurement for time.
Computer Performance Computer Engineering Department.
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Lev Finkelstein ISCA/Thermal Workshop 6/ Overview 1.Motivation (Kevin) 2.Thermal issues (Kevin) 3.Power modeling (David) 4.Thermal management (David)
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
EGRE 426 Computer Organization and Design Chapter 4.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CS 1410 Intro to Computer Tecnology Computer Hardware1.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice ProLiant G5 to G6 Processor Positioning.
What’s going on here? Can you think of a generic way to describe both of these?
Overview Motivation (Kevin) Thermal issues (Kevin)
GCSE OCR Computing A451 The CPU Computing hardware 1.
Lecture 2: Performance Today’s topics:
Temperature and Power Management
Seth Pugsley, Jeffrey Jestes,
Resource Aware Scheduler – Initial Results
Computer Memory.
Scaling the Memory Power Wall with DRAM-Aware Data Management
Intel Atom Architecture – Next Generation Computing
Frequency Governors for Cloud Database OLTP Workloads
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Short Circuiting Memory Traffic in Handheld Platforms
Admission Control and Request Scheduling in E-Commerce Web Sites
Objectives Describe how common characteristics of CPUs affect their performance: clock speed, cache size, number of cores Explain the purpose and give.
Course Code 114 Introduction to Computer Science
Computer Organization and Design Chapter 4
Presentation transcript:

© 2010 IBM Corporation What computer architects need to know about memory throttling WEED 2010 June 20, 2010 IBM Research – Austin Heather Hanson Karthick Rajamani

© 2010 IBM Corporation2 Outline Memory throttling overview Experimental platform –System configuration –Memory throttling implementation Memory throttling characterization –Bandwidth –Power –Performance Summary

© 2010 IBM Corporation3 Memory throttling in a nutshell Memory throttling is a power-performance knob that: –Impacts memory reference rates of both instruction and data streams –controls power –can be used for safety or optimization regulate DIMM temperatures enforce memory power budgets Memory throttling restricts read & write traffic –directly controls memory power –indirectly affects processors and other components Several implementation styles in commercial systems –insert periodic idle cycles –allow arbitrary number of transactions up to power (estimated) threshold –run + hold windows –enforce read & write quotas [this paper] first N transactions to proceed in time window any further requests wait until next time period

© 2010 IBM Corporation4 Comparison to clock throttling run-hold clock throttling regular frequency during run portion; clock halted during hold portion quota-style memory throttling reads & writes proceed as requested up to N requests per period Example: N = 6 Up to 6 transactions serviced per period, regardless of request timing Nth request in each period; additional requests would be queued for later service

© 2010 IBM Corporation5 POWER6 Memory Throttling IBM JS12 blade system –Processor POWER6 1 socket x 2 cores per processor socket 3.8 GHz frequency (fixed in these experiments) SLES10 linux –Memory 16 GB capacity 8 DIMMS x 2 GB each DDR2 667 MHz bus Quota-style memory throttling –N transactions per M memory cycles 100% throttle level == unthrottled –Time period is faster than thermal and power supply timescales

© 2010 IBM Corporation6 Memory throttle characterization methodology 1.Sweep throttle settings Set throttle Run steady-behavior benchmark DAXPY (double A * X plus Y) FPMAC (floating-point multiply accumulate) RandomMemory (generate random addresses) SPECPower_ssj2008 calibration phase (peak throughput for warehouse transactions) Record sensor data, 256ms per sample Memory power Memory reads & writes Instruction throughput And other sensors not shown here Decrement throttle Repeat for full range of throttle settings 2.Repeat throttle sweep for multiple benchmarks and memory footprints –Microbenchmarks: L1 cache contained and main memory footprints –SPECPower_ssj2008: behaves as nearly contained in on-chip caches 3.Calculate median sensor data for each permutation {benchmark, footprint, throttle}

© 2010 IBM Corporation7 saturated linear Memory throttle effect on bandwidth transition between linear & saturated regions

© 2010 IBM Corporation8 A closer look at RandomMemory-DIMM uses less bandwidth than other benchmarks at same throttle levels also less bandwidth than its own saturation level Simply measuring bandwidth at a single/current throttle level is not enough to identify a region of operation less than max could be saturated or transition region ….a controller will not be able to accurately predict the effect on bandwidth of a throttle level change …or predict the effect on power or performance Subtle but very important point about transition region Actual bandwidth < max bandwidth bandwidth restrictions pipeline starvation reduced request rate

© 2010 IBM Corporation9 Memory Power is basically linear with bandwidth, so this chart looks familiar….

© 2010 IBM Corporation10 power performance Throttling effects relative to each benchmark L1-contained DAXPY: throttling has no effect DIMM-sized DAXPY: drastic effect Generally more performance reduction than power reduction (in %) –Throttling alone doesnt affect static portion of memory power Leveraging idle low-power modes of memory can alter positively the power-performance curve for memory request rate throttling. –Possible to waste energy from longer execution time Larger bandwidth demands larger effect from throttling –Conversely, power reduction only when performance is impacted.

© 2010 IBM Corporation11 Summary Memory throttling is a power-performance knob available in commercial systems Memory controller restricts read & write bandwidth –caps memory power –controls DIMM temperature Mileage may vary –power and performance management depend on bandwidth demand throttling a low-bandwidth workload doesnt reduce much power –potential to use more energy due to increased execution time use highly throttled settings with caution Effective tool for power capping –power constrained configurations –thermal safety –power shifting

© 2010 IBM Corporation12 Acknowledgements IBM Research – Austin IBM Systems & Technology Group –Memory characterization: Joab Henderson, Kenneth Wright –EnergyScale firmware: Guillermo Silva, Andrew Geissler

© 2010 IBM Corporation13 Throughput: Instructions per Second Performance normalized to unthrottled, per benchmark L1-contained DAXPY: throttling has no effect DIMM-sized DAXPY: drastic effect

© 2010 IBM Corporation14 Memory Power: normalized to per-benchmark maximum