Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yang Li, Charles R. Lefurgy, Karthick Rajamani, Malcolm S

Similar presentations


Presentation on theme: "Yang Li, Charles R. Lefurgy, Karthick Rajamani, Malcolm S"— Presentation transcript:

1 A Scalable Priority-Aware Approach to Managing Data Center Server Power
Yang Li, Charles R. Lefurgy, Karthick Rajamani, Malcolm S. Allen-Ware, Guillermo J. Silva, Daniel D. Heimsoth, Saugata Ghose, Onur Mutlu Thanks for the introduction! Hello everyone! Today I would like to talk about the data center power management system that we built at IBM. I will highlight the problems and challenges that we face, and how we solve them. I participated in this project as an intern at IBM under the leadership of my mentors Dr. Charles Lefurgy and Dr. Karthick Rajamani.

2 Importance of Data Center Power Infrastructure
Critical impact on availability: Significant capital cost: Numerous service down due to power outage By Andy Patrizio, Network World, 2018 Tens of millions of US dollars Power infrastructure is an important component of data centers both in terms of its critical impact on availability and its cost. On the side of availability, we have seen power outages cause numerous service down throughout the world. Here I just list an article from Network world saying that based on a survey from Uptime Institute, power outages are increasing even in last year. On the side of cost, …

3 Underutilized Data Center Power Infrastructure
Considering peak demand, not typical demand for capacity sizing Employing redundant power infrastructures Power capacity is sized for 2X peak demand! Reproduced from Barroso et al., The Data Center as a Computer, 2013.

4 Boosting Data Center Performance
Why not add more servers to boost data center performance? During normal operation, let these servers utilize the spare capacity During the rare worst cases, let a power management system kick in Throttle the servers with less important workloads (within seconds) Protect the circuit breakers from tripping Problem: Build a data center power management system using real-time control +X% Our solution is a real-time measurement and control of data center power

5 Outline Problem Background Current Practice & Design Challenges
Key Solutions Implementations Evaluations Open Challenges Conclusions

6 Layout for Data Center Power Infrastructure
A-Side Power Feed B-Side Power Feed We need to protect: DC Contractual budget Circuit breakers of UPS, RPP, CDU Utility ATS Generator ATS Utility . . . UPS . . . UPS UPS UPS Circuit Breaker . . . . . . RPP RPP RPP RPP . . . Rack . . . CDU CDU CDU CDU Server Power Supply Server Power Supply . . . ATS: automatic transfer switch UPS: uninterrupted RPP: reactive power panel CDU: cabinet distribution unit Contractual budget: the total power budget for all the equipments that can consume; contract between facility operators and equipment owner Power Supply Power Supply

7 Outline Problem Background Current Practice & Design Challenges
Key Solutions Implementations Evaluations Open Challenges Conclusions

8 Hierarchical Control Framework
Contractual Budget Design Challenge #1: Consider the redundant connections Design Challenge #2: Enforce priority-aware power capping globally Shifting Controller ATS Shifting Controller . . . UPS UPS Shifting Controller Circuit Breaker . . . RPP RPP Shifting Controller . . . CDU CDU Capping Controller Server Power Supply . . . Power Supply

9 Design Challenges #1: Consider Redundant Connections
Servers do not split power equally Need to enforce different power caps for different supplies Server Power Supply A-side Power Feed B-side Power Feed 230 W A-side demand 400 W demand 170 W B-side demand

10 Design Challenges #1: Consider Redundant Connections
Servers do not split power equally Need to enforce different power caps for different supplies Prior works can only enforce a single combined power cap It is desirable to utilize the stranded power Cap Violation Server Power Supply A-side Power Feed B-side Power Feed 230 W A-side demand 170 W A-side cap 170 W A-side power Design a capping controller to enforce caps for each power supply; Design a shifting controller to utilize stranded power 400 W demand 400 W combined cap 300 W cap 100 W stranded power 170 W B-side demand 230 W B-side cap 130 W B-side power

11 Design Challenges #2: Enforce Priority Globally
Global Priority-Aware: Always cap low-priority servers before high-priority servers across the entire data center Prior works can only enforce priority locally within rack Top CB (Limit: 1400W) Left CB (Limit: 750W) Right CB (Limit: 750W) SD (Low Priority) 1240W Budget SC (Low Priority) SB (Low Priority) SA (High Priority) Minimum server power cap 𝑆𝑒𝑟𝑣𝑒𝑟𝑃 𝑐𝑎 𝑝 𝑚𝑖𝑛 =270𝑊 Design a shifting controller to perform global priority-aware power capping Demand 430 W Budget with Local Priority 350 W 270 W 310 W Budget with Global Priority 430 W 270 W Talk through say servers SA, SB, SC, and SD with different priority levels This is the procedures to handle A side. We repeat this for B-side.

12 Outline Problem Background Current Practice & Design Challenges
Key Solutions Implementations Evaluations Open Challenges Conclusions

13 Overview of CapMaestro
Contractual Budget Shifting Controller ATS For A-side Feed For B-side Feed Shifting Controller Shifting Controller . . . UPS UPS . . . UPS Shifting Controller Shifting Controller Circuit Breaker Circuit Breaker Enforce global priority aware power capping . . . . . . RPP RPP RPP Shifting Controller Shifting Controller . . . . . . CDU CDU CDU Capping Controller Server Power Supply Enforce cap for individual supplies . . . Power Supply

14 Capping Controller for Servers with Multiple Power Supplies
Goal: Ensure all individual power supply caps are respected, while allowing as much power consumption as possible Key Idea: Monitor the minimum error between power caps and power consumption Employ a Proportional-Integral (PI) controller 1 2 3 4 First green part: intel, amd, or ibm server Can deal with multiple power supplies, even some supplies fail, it can also deal with

15 Global Priority-Aware Power Capping
Key Idea: Power Metric Summary Summarize metrics for servers by priority under each shifting controller Use metrics to budget from high priority to low priority Power metrics (at each shifting controller): For each priority 𝑗, 𝑃 𝑐𝑎𝑝_𝑚𝑖𝑛 𝑗 : The minimum total power budget 𝑃 𝑑𝑒𝑚𝑎𝑛𝑑 𝑗 : The total power demand 𝑃 𝑟𝑒𝑞𝑢𝑒𝑠𝑡 𝑗 : The requested power budget 𝑃 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 : The maximum power budget for all the servers (regardless of priority) May not be fulfillable Fulfillable power with respect to other priorities We think the reasonable number of priorities is 10. We can support multiple priorities.

16 Global Priority-Aware Power Capping
Minimum server power cap 𝑆𝑒𝑟𝑣𝑒𝑟𝑃 𝑐𝑎 𝑝 𝑚𝑖𝑛 =270𝑊 High Low 𝑃 𝑐𝑎𝑝_𝑚𝑖𝑛 270 810 𝑃 𝑑𝑒𝑚𝑎𝑛𝑑 430 1290 𝑃 𝑟𝑒𝑞𝑢𝑒𝑠𝑡 970 𝑃 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 1400 1240W Budget Maximum server power cap 𝑆𝑒𝑟𝑣𝑒𝑟𝑃 𝑐𝑎 𝑝 𝑚𝑎𝑥 =490𝑊 Top CB (Limit: 1400W) High Low 430 270 High Low 𝑃 𝑐𝑎𝑝_𝑚𝑖𝑛 540 𝑃 𝑑𝑒𝑚𝑎𝑛𝑑 860 𝑃 𝑟𝑒𝑞𝑢𝑒𝑠𝑡 750 𝑃 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 980 High Low 540 High Low 𝑃 𝑐𝑎𝑝_𝑚𝑖𝑛 270 𝑃 𝑑𝑒𝑚𝑎𝑛𝑑 430 𝑃 𝑟𝑒𝑞𝑢𝑒𝑠𝑡 320 𝑃 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 980 Left CB (Limit: 750W) Right CB (Limit: 750W) 430 270 270 270 SA (High Priority) SB (Low Priority) SC (Low Priority) SD (Low Priority) Demand 430 W

17 Global Priority-Aware Power Capping
Rigorous theoretical proof: Servers with high priorities are always throttled after servers with lower priorities, as long as the circuit breaker limits allow See our IBM technical report Good scalability: Linear algorithm complexity at each controller Fixed ratio of overhead to # servers

18 Outline Problem Background Current Practice & Design Challenges
Key Solutions Implementations Evaluations Open Challenges Conclusions

19 Implementations Implemented as a first-of-a-kind cloud service
Controllers are packaged into containers

20 Implementations For A-side Feed For B-side Feed Room-Level Worker
Contractual Budget Shifting Controller ATS For A-side Feed For B-side Feed Shifting Controller Shifting Controller . . . UPS UPS . . . UPS Room-Level Worker Shifting Controller Shifting Controller Circuit Breaker Circuit Breaker . . . . . . RPP RPP RPP Shifting Controller Shifting Controller . . . . . . CDU CDU CDU Rack-Level Worker Capping Controller Server Power Supply . . . Power Supply

21 Implementations Implemented as a first-of-a-kind cloud service
Controllers are packaged into Worker containers Power controller measures and controls power at 8 second intervals Employ Intel  Node Manager as the underlining server throttling engine Cap power within 1 second Measure power supplies every 1 second Our power control framework supports dynamic server priorities intension

22 Outline Problem Background Current Practice & Design Challenges
Key Solutions Implementations Evaluations Open Challenges Conclusions

23 Evaluations Performed several real-system experiments demonstrating that: Our capping controllers successfully enforce power caps for individual supplies Our shifting controllers ensure servers with high priority are throttled after servers with lower priorities Our shifting controllers reallocate stranded power to the underpowered servers Performed a large-scale data center simulation demonstrating that: Compared to the case of no power management system, our system enables the data center to deploy 50% more servers

24 Key Results Simulate a data center of 162 racks; 30% of servers are assigned high priority Cap Ratio for high-priority servers under a power emergency: Using 1% cap ratio as threshold, our system supports 5832 servers, while Local Priority only supports 4860 servers Throttled performance  Full performance  6-45 servers per rack 3888 servers 4860 servers 5832 servers

25 Outline Problem Background Current Practice & Design Challenges
Key Solutions Implementations Evaluations Open Challenges Conclusions

26 Open Challenges Broadening the target of power management:
Comprehensive power capping scheme for more system components, such as GPU, FPGA, storage, and networking Coordination of job scheduling with power management: Controlling server power by controlling the number of jobs scheduled Fine-grained job-level power control for jobs collocated on the same server Crossing provider-user boundaries for energy savings: Cloud providers need to make the benefits of energy savings visible to users Need to overcome the issue of per-user power metering The ideas is, here are the open challenges for the community, so that we can better deploy power management for data centers.

27 Outline Problem Background Current Practice & Design Challenges
Key Solutions Implementations Evaluations Open Challenges Conclusions

28 Conclusions Data center power capacity is heavily underutilized
Utilize this spare capacity by adding additional servers Employ a power management system to deal with power emergencies or faults Deal with power feeds Enforce priority-aware power capping globally across the entire data center Utilize the stranded power Our power management system boosts data center server capacity by 50% Highlight other open challenges in data center power management

29 Thank you!

30 SDP Architecture SDP Manager SDP DB SDP Worker Fleet worker worker
Inventory and power topology Assign tasks SDP Manager SDP DB Tasks, budget SDP Worker Sensor data Fleet worker worker Budget request & consumption Assignments, budgets & consumption SDP Monitor Diagnostics Service status Servers Remote Power Panel Alerts Eventing Genesis service Metrics Datacenter Hardware PDU SDP container

31 Utilize Stranded Power
After budgeting via the global priority-aware power capping, Do not assign the budget immediately Instead, estimate stranded power Adjust the requested power and rebudget  Stranded power is shifted to other servers 170 W cap 170 W estimated power Server Power Supply A-side Power Feed B-side Power Feed The outcome is stranded power is naturally shiftted 230 W cap 130 W estimated Power = 100 W stranded power


Download ppt "Yang Li, Charles R. Lefurgy, Karthick Rajamani, Malcolm S"

Similar presentations


Ads by Google