Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical.

Slides:

Advertisements

Similar presentations

Simulation of Feedback Scheduling Dan Henriksson, Anton Cervin and Karl-Erik Årzén Department of Automatic Control.

Advertisements

Performance Testing - Kanwalpreet Singh.

Hadi Goudarzi and Massoud Pedram

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.

Virtualization in HPC Minesh Joshi CSC 469 Dr. Box Feb 1, 2012.

Energy Conservation in Datacenters through Cluster Memory Management and Barely-Alive Memory Servers Vlasia Anagnostopoulou Susmit.

Green Cloud Computing Hadi Salimi Distributed Systems Lab, School of Computer Engineering, Iran University of Science and Technology,

XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.

Efficient Autoscaling in the Cloud using Predictive Models for Workload Forecasting Roy, N., A. Dubey, and A. Gokhale 4th IEEE International Conference.

SLA-aware Virtual Resource Management for Cloud Infrastructures

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

Application Models for utility computing Ulrich (Uli) Homann Chief Architect Microsoft Enterprise Services.

Energy Management and Adaptive Behavior Tarek Abdelzaher.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.

Chapter 14 Chapter 14: Server Monitoring and Optimization.

1 Action Breakout Session Anil, AP, Nina Bhatti, Charles Berdnall, Joe Hellerstein, Wei Hu, Anthony Joseph, Randy Katz, Li, Machi Mukund Kimmo Raatikanen,

Energy Efficient Web Server Cluster Andrew Krioukov, Sara Alspaugh, Laura Keys, David Culler, Randy Katz.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Virtualization in Data Centers Prashant Shenoy

Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.

OS and the Computer System  Some OS programs exist permanently in the system area of the memory to monitor and control activities in the computer system.

© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Automated Workload Management in.

Cutting the Electric Bill for Internet-Scale Systems Andreas Andreou Cambridge University, R02

Computer Science Cataclysm: Policing Extreme Overloads in Internet Applications Bhuvan Urgaonkar and Prashant Shenoy University of Massachusetts.

Self-Adaptive QoS Guarantees and Optimization in Clouds Jim (Zhanwen) Li (Carleton University) Murray Woodside (Carleton University) John Chinneck (Carleton.

Green IT and Data Centers Darshan R. Kapadia Gregor von Laszewski 1.

Department of Computer Science Engineering SRM University

CPU Scheduling Chapter 6 Chapter 6.

Chapter 4 Processor Management

Virtual Machine Course Rofideh Hadighi University of Science and Technology of Mazandaran, 31 Dec 2009.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.

Sensor-Based Fast Thermal Evaluation Model For Energy Efficient High-Performance Datacenters Q. Tang, T. Mukherjee, Sandeep K. S. Gupta Department of Computer.

Network Aware Resource Allocation in Distributed Clouds.

Low-Power Wireless Sensor Networks

Cloud Computing Energy efficient cloud computing Keke Chen.

Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.

Event Management & ITIL V3

Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.

Challenges towards Elastic Power Management in Internet Data Center.

1 Network Monitoring Mi-Jung Choi Dept. of Computer Science KNU

Data Placement and Task Scheduling in cloud, Online and Offline 赵青天津科技大学

1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.

Automated Control in Cloud Computing: Challenges and Opportunities Harold C. Lim, Shivnath Babu, Jeffrey S. Chase, and Sujay S. Parekh ACM’s First Workshop.

Managing the Performance Impact of Administrative Utilities Paper by S. Parekh,K. Rose, J.Hellerstein, S. Lightstone, M.Huras, and V. Chang Presentation.

Maintaining and Updating Windows Server Monitoring Windows Server It is important to monitor your Server system to make sure it is running smoothly.

Dana Butnariu Princeton University EDGE Lab June – September 2011 OPTIMAL SLEEPING IN DATACENTERS Joint work with Professor Mung Chiang, Ioannis Kamitsos,

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Oindrila.

Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.

Modeling Virtualized Environments in Simalytic ® Models by Computing Missing Service Demand Parameters CMG2009 Paper 9103, December 11, 2009 Dr. Tim R.

CSCI1600: Embedded and Real Time Software Lecture 24: Real Time Scheduling II Steven Reiss, Fall 2015.

Full and Para Virtualization

Windows Server 2003 系統效能監視林寶森

Managing Web Server Performance with AutoTune Agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigus Presented by Changha Lee.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Dynamic Placement of Virtual Machines for Managing SLA Violations NORMAN BOBROFF, ANDRZEJ KOCHUT, KIRK BEATY SOME SLIDE CONTENT ADAPTED FROM ALEXANDER.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Best detection scheme achieves 100% hit detection with

Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

OPERATING SYSTEMS CS 3502 Fall 2017

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Software Architecture in Practice

Action Breakout Session

Real-time Software Design

Comparison of the Three CPU Schedulers in Xen

Capriccio – A Thread Model

CPU SCHEDULING.

STEP VIRTUAL MACHINE MIGRATION FOR DYNAMIC RESOURCE ALLOCATION IN CLOUD COMPUTING ENVIRONMENT Guided By 2 2 STEP ParticipantsName Register Number K. Dileswara.

Presentation transcript:

Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden

Energy Management in Data Centers Total consumption: 2% of energy spent in US (EPA estimate) Energy bill is 20-50% of total profit Energy expended on: Computing (powering up racks of machines) Sensors: Utilization, Delay, Throughput, … Actuators: DVS, turning machines On/Off Cooling Sensors: Temperature, air flow, … Actuators: Air-conditioning units, fans, …

Current Status Increased emphasis on energy control More “manipulation knobs” are introduced to manage energy and performance Challenge Knobs may interact in unexpected ways Different performance and energy management policies may interfere with one another Uncoordinated interference of multiple knobs can lead to instability or poor efficiency

Energy Saving A Tale of Two Policies DVS + On/Off: more energy consumption than DVS or On/Off alone! DVS alone On/Off alone Empirical measurements from a 30-machine 3-tier testbed of a shopping site

Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

Response Time Control Problem in VMs With only CPU control, response time severely violated. Why? VM1 VM2 CPU has been popular for controlling response time Goal: dynamically change CPU shares of VMs to meet RT constraint

Memory Utilization, Disk I/O, and CPU Consumption Page faults drastically increase after a certain threshold # of page faults as a function of memory utilization CPU as a function of memory utilization Significant CPU overhead after the threshold - Increase in CPU usage mainly caused by extra paging activities

Response Time and Memory Utilization Sharp increase in response time after a certain threshold, say 90% To achieve the desired performance, we need to avoid the “bad” region

VM 1 (App 1) VM n (App n) SrSr CPU Controller Memory Controller SpSp SnSn SpSp VMM Application-level performance Resource usage CPU allocation Memory allocation Application SLOs Application-level performance Resource usage CPU Scheduler Memory Manager CPU controller for controlling response time Memory controller makes sure the memory utilization doesn’t go over 90% CPU and Memory Control

VM1 VM2 Performance of Joint Controllers with Synthetic Workload Cont. Without dynamic memory control, VMs cannot get enough memory when memory gets scarce Joint controller gives just enough memory not to fall into the bad region. Efficiently utilize physical memory

Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone

DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone The DVS and On-Off “knobs” must be controlled holistically in a coordinated manner as a solution to an optimization problem

Results DVS + On/Off DVS alone On/Off alone Optimal

Fixed cooling set point Fixed number of machines Holistic Optimization Even Bottom-Up Bottom-Up + Off Even Bottom-Up Optimal Energy Saving Measurements from a Machine Room

Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

Help the Admin: Administrative Cost is Sky Rocketing!

Diagnostics 19 In mechanical systems, components are connected and correlated In software systems, key variables in adaptive actions are correlated Monitor changes in correlations to diagnose performance problems Correlations are broken, the system may not perform as expected

20 DR U + + AC 1.Learning phase: learn adaptation graph by calculating correlation coefficient Automated-detection Target performance reference Adaptation GraphBackup Policy Performance Knobs (Actuators) Sensors Target System Detect assumption violation System workload System output Control knob settings Monitor the target system Translate into causality assumptions Regulation Policy Learned Estimated DR U + AC 2. At run-time: periodically recalculate the sign of edges in adaptation graph 3. Check the sign Diagnostics

21 DR U + + AC Stop the component causing the sign problem Execute backup action: open loop action Try several times Regluation Policy Automated-detection Target performance reference Adaptation Graph Backup Policy Performance Knobs (Actuators) Sensors Target System Detect assumption violation System workload System output Control knob settings Monitor the target system Translate into causality assumptions Diagnostics

22 Example Increased workload  interrupt handling to polling  utilization drops Controller tries to accept more requests  Aggrevate the situation  Most new requests dropped by kernel. No prioritization enforced Unintended interaction between an utilization controller in a Web server and the kernel anti-livelock mechanism:  Admission control based on utilization.  It drops lower priority request first UtilPd Req + + AC UtilPd Req + AC

23 Diagnostics Example 2. Closed loop - violation 2. Utilization sharply drops due to decrease in the number of interrupts 3. Admission control policy tries to accept more requests, aggravating the situation 1. Network processing is overloaded: switching from interrupt handling to polling 1. Closed loop CPU utilization # of network interrupts Correlation Req  Util becomes broken

More on Diagnostics Correlations between continuous variables do not uncover problems due to sequences of discrete events Focus on runtime events related to performance Ex) turn on machines. Decrease DVS, send a packet, etc. Find a (cyclic) sequence of events that discriminates “good” and “bad” perfornance cases Data mining technique: discriminative sequence analysis

Main Idea Log different events during runtime Most of the time the system works Occasionally it performs poorly Generate the frequent sequences of events that occurs when the system works correctly Generate the frequent sequences of events that occurs when the system exhibits undesirable behavior Identify the “culprit” sequences of events that are found only in the latter case but not the former.

Low Throughput A Case Study on a “Hot” Day: Throughput of a Server Farm

Three Performance Control Policies Thermal Management Policy Puts machine to sleep if machine is overheated Energy Aware Load Balancer Distributes load based on average CPU utilization Attempts to minimize the number of machines in use Machine On/Off Policy Turns off idle machines to save energy

Maximum temperature is well Below 60 degrees Regular Operating Condition

Anomalous Condition Maximum temperature is above 60 degrees

Eventually, only the overheated machine remained on! Anomalous Condition Maximum temperature is above 60 degrees

Diagnostics Output: Reported Culprit Event Sequences Cycle: SleepEvent, WakeUpEvent Cycle: Temp: , Temp: ,

Diagnostics Output: Reported Culprit Event Sequences Cycle: SleepEvent, WakeUpEvent Cycle: Temp: , Temp: , Oops: Utilization is computed based on a recent time average (including “sleep” time)  Artificially low if machine sleeps

What was going on? No matter how much task is assigned to the overheated machine, utilization remains well below threshold due to periodic sleeping Load balancer keeps assigning more and more tasks to the overheated machine On/Off policy keeps turning off other machines

Conclusions (the needs) Must Identify the right knobs to manipulate (e.g., example with virtual machine memory allocation) Must manage them in a jointly optimal manner to avoid instability or poor performance Must develop automated self-diagnostic techniques to reduce administrator effort

Conclusions (the tools) Control theory of positive systems offers interesting insights into distributing the holistic management of interacting feedback control knobs in data centers Advances in event-based control offer opportunities to significantly reduce actuation overhead (e.g., number of times machines are tuned on/off without degrading performance Advances in discriminative sequence mining offer opportunities for improving self-diagnostic capabilities in complex systems