Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1.

Slides:



Advertisements
Similar presentations
© 2010 IBM Corporation What computer architects need to know about memory throttling WEED 2010 June 20, 2010 IBM Research – Austin Heather Hanson Karthick.
Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
High Performing Cache Hierarchies for Server Workloads
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Power Aware Virtual Machine Placement Yefu Wang. 2 ECE Introduction Data centers are underutilized – Prepared for extreme workloads – Commonly.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Green Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology,
Memory System Characterization of Big Data Workloads
Improving Energy Efficiency by Making DRAM Less Randomly Accessed Hai Huang, Kang G. Shin, Charles Lefurgy, Tom Keller University of Michigan IBM Austin.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Memory Management Virtual Memory Page replacement algorithms
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Software-Hardware Cooperative Power Management Technique for Main Memory So, today I’m going to be talking about a software-hardware cooperative power.
Virtual Memory.
Dynamic and Decentralized Approaches for Optimal Allocation of Multiple Resources in Virtualized Data Centers Wei Chen, Samuel Hargrove, Heh Miao, Liang.
Low Power Techniques in Processor Design
Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.
CPU Scheduling Chapter 6 Chapter 6.
Power Issues in On-chip Interconnection Networks Mojtaba Amiri Nov. 5, 2009.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Adaptive Power Shifting in Server Systems Ming Chen Xue Li.
Energy Savings with DVFS Reduction in CPU power Extra system power.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
1 Our focus  scheduling a single CPU among all the processes in the system  Key Criteria: Maximize CPU utilization Maximize throughput Minimize waiting.
A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Optimizing Power and Energy Lei Fan, Martyn Romanko.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Accounting for Load Variation in Energy-Efficient Data Centers
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Sunpyo Hong, Hyesoon Kim
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
CS 704 Advanced Computer Architecture
Memory Management.
Performance directed energy management using BOS technique
Adaptive Cache Partitioning on a Composite Core
Green cloud computing 2 Cs 595 Lecture 15.
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ASR: Adaptive Selective Replication for CMP Caches
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Cache Memory Presentation I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Milad Hashemi, Onur Mutlu, Yale N. Patt
Chapter 15 – Part 1 The Internal Operating System
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
Presentation transcript:

Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1

Why Power Control for Main Memory?  Memory capacities have been increasing significantly to accommodate CMPs. −CPU is no longer the only major power consumer.  Memory is highly under-utilized. −Requirement on amount −Requirement on bandwidth 85.4% 92.2% CPU & memoryCPU : memory

Limiting the Power Consumption of Main Memory Acknowledgments: The organization order and contents of some slides are based on Ricardo Bianchini’s slides. Bruno Diniz, Dorgival Guedes, Wagner Meira Jr. Federa University of Minas Gerais, Brazil Ricardo Bianchini Rutgers University, USA 3

Power Saving Vs. Power Control  A problem with two sides −Trade-off between power and performance  Power saving: −Guarantee performance first, then minimize power. −Performance is primary. −Save energy bill.  Power control: −Power capping: cooling, thermal, packaging, etc −Guarantee power budget first, then maximize performance. −Power budget is primary. −Avoid system failure and thermal violations. 4

What is This Paper About?  Bunches of work done to save power.  The first paper I have read on power control for memory.  Propose 4 policies for Power Limiting (PL) in memory. −Knapsack, LRU-Greedy, LRU-smooth, LRU-Ordered.  Combine Power Limiting with Energy Conserving (PL- EC).  Also provide performance guarantee. (PL-EC-Perf) An interesting paper that combines the two sides of the power problem together. 5

Power Actuators  RDRAM systems. −It is DRAM but not DDR SDRAM  Each chip can be transitioned independently.  Different power states 6

Power Limiting Memory Controller chip1 chip3 chip4chip2 A SN S 7

Power Limiting Memory Controller chip1 chip3 chip4chip2 access A SN S 8

Power Limiting Memory Controller chip1 chip3 chip4chip2 A SN S 9

Adjusting Power States Memory Controller chip1 chip3 chip4chip2 SN S N 10

Power Limiting Memory Controller chip1 chip3 chip4chip2 access S SA N Different approaches to adjust states. 11

Knapsack: Key Idea  Multi-Choice Knapsack Problem (MCKP) −Object: memory device. −Choices: multiple (power) states. −weight : the power consumption. −cost : the transition overhead to the active state.  Goal: Minimize the cost with the constraint of weight by putting each object in a state.  MCKP is NP-hard, which is solved off-line. 12

chip # power state Knapsack: An example  LRU queue is maintained for active devices.  The LRU device is the victim.  Switch the power states of the two chips. 13

LRU-Greedy  An LRU queue for all devices  If a device is to be accessed, move it to the tail and: −Active? Go on … −Not active? Put the LRU to the shallowest state. Still not? adjust the state for the next device. 14

LRU-Smooth  A LRU queue for all devices  If a device is to be accessed, move it to the tail and: −Active? Go on … −Not active? Put the LRU to the next lower-power state. Still not? adjust the state for the next device. 15

LRU-Ordered: Key Idea  An LRU queue for active devices  An ordered queue for devices in low-power mode (shallowest first)  If a device is to be accessed: −Active? Move it to the tail of LRU and Go on … −Not active? Move it from ordered queue to the tail of LRU queue. Put the LRU to the top of the ordered queue. Still not? adjust the state of the next device in the ordered queue to the next lower-power state. 16

LRU-Ordered: An example 17

Energy Conservation (PL-EC)  If idle time in the current state > break-even time, then lower power state.  Minimize delay*power 2 by Knapsack for different # of devices in the active state.  Whenever the # of active states is going to change: −An active device is transitioned to the next low-power state when threshold expires. −A low-power device is transitioned to the active state and it does not violate the budget  The memory controller looks up the table and adjusts the states.  If the activating device violates the budget, the basic scheme (PL) is used. 18

Performance Guarantee (PL-EC-Perf)  To what extent the energy to be saved?  Basic strategy is from Xiaodong Li’s ASPLOS04 paper. −5M-cycle epoch −User-defined slowdown (3%) compared with PL −Compute slack at runtime. −If slack < 0, disable EC until the end of epoch  Disabling EC means reverting back to the corresponding PL policy. 19

Evaluation Methodology  Single-core in-order CPU with integrated memory controller  Simics + memory subsystems  OS and physical mapping of virtual pages are both simulated.  Memory system is driven by traces generated by Simics.  Workloads: MediaBench, SPEC 2000, and client-server applications.  Memory size: 512 MB  Performance is measured by the execution time of a trace file. 20

Performance Vs. PL Polices  Knapsack and LRU-Ordered are best.  8 chips, 50% power budget. 21

Energy Vs. Policies  Compared with unrestricted execution.  Unrestricted < budget  bzip? (Critique)  8 chips, 50% power budget. 22

Performance Vs. Budget  8 chips under LRU-Ordered.  Performance degradation is very small. 23

Energy Vs. Power Budget  Saving decreases as budget decreases.  Uniform for all workload. 24  8 chips, 50% power budget.

Performance for PL-EC-Perf  8 chips, 25% power budget  LRU-Ordered  3% slowdown  PD: an explicit energy saving algorithm  PL/PL-EC-Perf almost works no worse than PD  Exception is bzip2 ? 25

Energy Saving for PL/PL-EC-Perf  8 chips, 25% power budget  LRU-Ordered  3% slowdown  PD: an explicit energy saving algorithm  PL/PL-EC-Perf has more energy saving than PD.  PD tends to send some chips to very deep states. 26

Conclusions  Four power limiting policies are proposed.(PL)  Performance degradation is surprisingly low.  Limiting power + energy conserving (PL-EC)  Limiting power + energy conserving + performance guarantee (PL-EC-Perf)  Limiting power consumption is as effective as doing energy conservation explicitly. 27

A Performance-Conserving Approach for Reducing Peak Power Consumption in Server Systems Acknowledgments: The organization order and contents of some slides are based on Wes Felter’s slides. Wes Felter, Karthick Rajamani, Tom Keller IBM Austin Research Lab Cosmin Rusu University of Pittsburgh 28

Motivations  System designers can no longer afford to accommodate peak power of all components (over- provisioning)  System failures due to power overload and thermal violation.  CPU is no long the only major power consumer.  CPU and the main memory share the same power/cooling facility. 29

Anti-Correlation of Processor and Memory Power  Processor and memory are not simultaneously highly utilized in workloads.  Intuitively, the processor can’t keep itself and the memory busy 30

Unconstrained System Power In theory, the system can use 83W. 31

What is This Paper About?  Power shifting between processor and memory.  The first paper that proposes the concept of “power shifting”.  Power estimation model based on the # of activities.  Propose 3 policies for power control in the server level. −PLI, sliding window, and on-demand 32

Processor Power Model  Power Vs. Dispatched Instr./cycle  100K-cycle interval  28 applications  Linear regression  33

Memory Power Model  Power Vs. Bandwidth  100K-cycle interval  28 applications  Linear regression   P mem =#ranks*#devices*V DRAM *((I active - I idle )*BW/ Peak BW + I idle )+ P others 34

Power Actuators  Power consumption correlates strongly with activity.  Activity regulation techniques: −Instruction decoding throttling −Clock throttling: effective duty cycles −DVFS  For processor: −Throttle at the instruction dispatch unit of the pipeline.  For memory: −Limit the total # of memory requests 35

Processor Core System Power Controller Dispatched Insns. Counter Memory Fetch Throttling Memory Controller Request Counter Request Throttling Goes into powerdown mode when idle Extensive clock gating System Architecture 36

 The # of activities is the same in the next interval.  Estimate the power based on history.  Allocate power based on estimates.  Power allocation is enforced by thresholds of the # of activities  Activity-dependent power and standby power Key Ideas 37

 Power estimation −CPU power = C1*DPC0+C2 −Memory power = M1*BW0+M2 − Power to be allocated: P dynamic = P budget – C2 – M2 −DPC1 and BW1 for the next interval −Estimated active power: P est = DPC0*C1+BW0*M1  Power allocation −DPC1= DPC0*P dynamic /P est −BW1=BW0*P dynamic /P est  Threshold −D th = DPC1*Period −M th =BW1*Period Proportional-Last-Interval Policy 38

 Sliding window −Shorter interval is better for estimation accuracy. −Larger interval is better for reducing noise. −A larger window includes 20 intervals.  On demand −No violations, no throttling −Interval should be small enough.  Run-To-Exhaustion (RTE) −Power is monitored cycle-by-cycle. −Throttle when power is violated. −Impractical but provides a comparison.  Static: proportional to the peak power Other Related Policies 39

Simulation Environment  Traces from hardware (SPEC) and Mambo (e.g. JBB)  Integrated simulation environment −Turandot+PowerTimer (Zhigang Hu et al., IBM TJ Watson) Timing and power core model 2GHz 970-like core w/ aggressive clock gating 512KB L2 cache (power not simulated) −+ MEMSIM Timing and power DRAM model 4GB, 4 ranks of 128-bit 400MHz DDR (PC3200) −Both simulators synchronized every cycle 40

 Budget: 40 W  Much better than static budgeting PLI Vs. Static (1) 41

 Budget: 50 W  Average unconstrained power consumption < budget PLI Vs. Static (2) 42

 100K-cycle interval, 40W budget  On-demand is generally the best. −At the cost of at least one-interval budget violation Policy Comparison 43

 On-demand and the sliding window are generally the best.  On-demand is even better than RTE for art. −Not proactively throttle activities and at cost of short violations. Normalized to RTE 44

Interval Size (PLI)  ammp: highly variable even at small interval and steady power.  Small interval has better fit for variations.  Generally, highly application-dependent. 45

Critiques  Paper 1 −Lack explanations for different workload (e.g. bzip). −Examples used for policies are not typical. −Do not explain why the performance is surprisingly slightly degraded.  Paper 2 −Open-loop estimation based scheme −Model verification −Power is not shifting but throttled −Very large performance degradation even budget is larger than the average. 46

Comparison of the Two Papers Limiting PowerPower Shifting TargetPeak power of DRAM systemPeak power in the server level GoalPeak power capping with energy conserving Peak power capping MethodologyKnapsack optimization + heuristic Open-loop estimation SolutionsA bunch of policies and comparison ExperimentsSimulation Power budgetLarger than the average 47

Thank you ! 48