Presentation is loading. Please wait.

Presentation is loading. Please wait.

Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical.

Similar presentations


Presentation on theme: "Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical."— Presentation transcript:

1 Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden

2 Energy Management in Data Centers Total consumption: 2% of energy spent in US (EPA estimate) Energy bill is 20-50% of total profit Energy expended on: Computing (powering up racks of machines) Sensors: Utilization, Delay, Throughput, … Actuators: DVS, turning machines On/Off Cooling Sensors: Temperature, air flow, … Actuators: Air-conditioning units, fans, …

3 Current Status Increased emphasis on energy control More “manipulation knobs” are introduced to manage energy and performance Challenge Knobs may interact in unexpected ways Different performance and energy management policies may interfere with one another Uncoordinated interference of multiple knobs can lead to instability or poor efficiency

4 Energy Saving A Tale of Two Policies DVS + On/Off: more energy consumption than DVS or On/Off alone! DVS alone On/Off alone Empirical measurements from a 30-machine 3-tier testbed of a shopping site

5 Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

6 Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

7 Response Time Control Problem in VMs With only CPU control, response time severely violated. Why? VM1 VM2 CPU has been popular for controlling response time Goal: dynamically change CPU shares of VMs to meet RT constraint

8 Memory Utilization, Disk I/O, and CPU Consumption Page faults drastically increase after a certain threshold # of page faults as a function of memory utilization CPU as a function of memory utilization Significant CPU overhead after the threshold - Increase in CPU usage mainly caused by extra paging activities

9 Response Time and Memory Utilization Sharp increase in response time after a certain threshold, say 90% To achieve the desired performance, we need to avoid the “bad” region

10 VM 1 (App 1) VM n (App n) SrSr CPU Controller Memory Controller SpSp SnSn SpSp VMM Application-level performance Resource usage CPU allocation Memory allocation Application SLOs Application-level performance Resource usage CPU Scheduler Memory Manager CPU controller for controlling response time Memory controller makes sure the memory utilization doesn’t go over 90% CPU and Memory Control

11 VM1 VM2 Performance of Joint Controllers with Synthetic Workload Cont. Without dynamic memory control, VMs cannot get enough memory when memory gets scarce Joint controller gives just enough memory not to fall into the bad region. Efficiently utilize physical memory

12 Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

13 DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone

14 DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone The DVS and On-Off “knobs” must be controlled holistically in a coordinated manner as a solution to an optimization problem

15 Results DVS + On/Off DVS alone On/Off alone Optimal

16 Fixed cooling set point Fixed number of machines Holistic Optimization Even Bottom-Up Bottom-Up + Off Even Bottom-Up Optimal Energy Saving Measurements from a Machine Room

17 Three Performance Management Challenges Avoid the “avoidable” (bad) interactions Manage the “unavoidable” interactions (so they do not lead to instability) Troubleshoot remaining interaction problems

18 Help the Admin: Administrative Cost is Sky Rocketing!

19 Diagnostics 19 In mechanical systems, components are connected and correlated In software systems, key variables in adaptive actions are correlated Monitor changes in correlations to diagnose performance problems Correlations are broken, the system may not perform as expected

20 20 DR U + + AC 1.Learning phase: learn adaptation graph by calculating correlation coefficient Automated-detection Target performance reference Adaptation GraphBackup Policy Performance Knobs (Actuators) Sensors Target System Detect assumption violation System workload System output Control knob settings Monitor the target system Translate into causality assumptions Regulation Policy Learned Estimated DR U + AC 2. At run-time: periodically recalculate the sign of edges in adaptation graph 3. Check the sign Diagnostics

21 21 DR U + + AC Stop the component causing the sign problem Execute backup action: open loop action Try several times Regluation Policy Automated-detection Target performance reference Adaptation Graph Backup Policy Performance Knobs (Actuators) Sensors Target System Detect assumption violation System workload System output Control knob settings Monitor the target system Translate into causality assumptions Diagnostics

22 22 Example Increased workload  interrupt handling to polling  utilization drops Controller tries to accept more requests  Aggrevate the situation  Most new requests dropped by kernel. No prioritization enforced Unintended interaction between an utilization controller in a Web server and the kernel anti-livelock mechanism:  Admission control based on utilization.  It drops lower priority request first UtilPd Req + + AC UtilPd Req + AC

23 23 Diagnostics Example 2. Closed loop - violation 2. Utilization sharply drops due to decrease in the number of interrupts 3. Admission control policy tries to accept more requests, aggravating the situation 1. Network processing is overloaded: switching from interrupt handling to polling 1. Closed loop CPU utilization # of network interrupts Correlation Req  Util becomes broken

24 More on Diagnostics Correlations between continuous variables do not uncover problems due to sequences of discrete events Focus on runtime events related to performance Ex) turn on machines. Decrease DVS, send a packet, etc. Find a (cyclic) sequence of events that discriminates “good” and “bad” perfornance cases Data mining technique: discriminative sequence analysis

25 Main Idea Log different events during runtime Most of the time the system works Occasionally it performs poorly Generate the frequent sequences of events that occurs when the system works correctly Generate the frequent sequences of events that occurs when the system exhibits undesirable behavior Identify the “culprit” sequences of events that are found only in the latter case but not the former.

26 Low Throughput A Case Study on a “Hot” Day: Throughput of a Server Farm

27 Three Performance Control Policies Thermal Management Policy Puts machine to sleep if machine is overheated Energy Aware Load Balancer Distributes load based on average CPU utilization Attempts to minimize the number of machines in use Machine On/Off Policy Turns off idle machines to save energy

28 Maximum temperature is well Below 60 degrees Regular Operating Condition

29 Anomalous Condition Maximum temperature is above 60 degrees

30 Eventually, only the overheated machine remained on! Anomalous Condition Maximum temperature is above 60 degrees

31 Diagnostics Output: Reported Culprit Event Sequences Cycle: SleepEvent, WakeUpEvent Cycle: Temp: 65 - 70, Temp: 60 - 65,

32 Diagnostics Output: Reported Culprit Event Sequences Cycle: SleepEvent, WakeUpEvent Cycle: Temp: 65 - 70, Temp: 60 - 65, Oops: Utilization is computed based on a recent time average (including “sleep” time)  Artificially low if machine sleeps

33 What was going on? No matter how much task is assigned to the overheated machine, utilization remains well below threshold due to periodic sleeping Load balancer keeps assigning more and more tasks to the overheated machine On/Off policy keeps turning off other machines

34 Conclusions (the needs) Must Identify the right knobs to manipulate (e.g., example with virtual machine memory allocation) Must manage them in a jointly optimal manner to avoid instability or poor performance Must develop automated self-diagnostic techniques to reduce administrator effort

35 Conclusions (the tools) Control theory of positive systems offers interesting insights into distributing the holistic management of interacting feedback control knobs in data centers Advances in event-based control offer opportunities to significantly reduce actuation overhead (e.g., number of times machines are tuned on/off without degrading performance Advances in discriminative sequence mining offer opportunities for improving self-diagnostic capabilities in complex systems


Download ppt "Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical."

Similar presentations


Ads by Google