Optimal Power Allocation in Server Farms

Optimal Power Allocation in Server Farms
ANSHUL GANDHI Carnegie Mellon Univ. Mor Harchol-Balter Carnegie Mellon Univ. Rajarshi Das IBM, T.J. Watson Charles Lefurgy IBM, Austin

U.S. Data Center Energy Consumption
120 billion kWh kWh (in billions)  50 billion kWh $ 8.4 billion Let’s start by looking at why power is important. The graph here illustrates the total energy consumption in the US datacenters per year. As we can see, the energy consumption has increased by more than a factor of 4 between the years 2000 and 2006, and is projected to go up by almost a factor of 10 by the year 2011. <CLICK> In terms of money, that would amount to almost 7.4 billion dollars. Now that’s a lot of money ! 12 billion kWh Source: EPA report to Congress on Server and Data Center Energy Efficiency ,2007

Get the best performance from the
Goal Get the best performance from the power, P, that we have. Data Center Thus, we focus our attention on datacenters. A datacenter is made up of many racks of servers. <CLICK> Each rack looks something like this. We shall refer to a rack of servers as a server farm. Each server farm is limited to a fixed power consumption of P. Our goal is to get the best performance out of the rack, given the fixed power limit P. (cont.) P

Goal How to split P to minimize mean response time?
Right answer can improve performance by up to 5X P1 P P2 In this talk, we will look at mean response time as our performance metric. Response time for a job is defined as the time from when a job comes into the system, till it departs. Now, a natural question that comes up is, how to split this power P among the servers in the server farm, say, P1, P2 and P3. <CLICK> Our constraint, therefore, is that P should be greater than or equal to the sum of P1, P2 and P3. <CLICK> As we will show, the right answer can improve server mean response time by up to a factor of 5. Note that we are assuming a fixed power limit P, and we want to minimize the mean response time. Thus, we don’t care whether the total power consumption is equal to P or less than P. When a data center is set up, it is typically provisioned for a maximum power consumption. Thus, the cooling requirements, the circuit breakers and the alternate power supplies are all built keeping this power limit in mind. So in this talk, we simply assume we have a fixed power limit P. P3 Constraint: P ≥ P1 + P2 + P3

Power Efficient Load Balancer
Frequency = server speed Freq (GHz) Freq (GHz) Output Power (Watts) Power (Watts) Input P Speed scaling Workload Arrival rate Open vs. Closed Max speed Min speed . q1 P1 P q2 POWER EFFICIENT LOAD BALANCER So here is our server farm. What we will do is create <CLICK> a power efficient load balancer <CLICK> that will magically output the power distribution, P1, P2 and P3 and also the load distribution q1, q2 and q3. q1 here represents the fraction of incoming load sent to server 1. Similarly we have q2 and q3 for servers 2 and 3. <CLICK> As its input, the power efficient load balancer takes a long list of factors. The total power P. <CLICK> The speed scaling technology in a server. By speed scaling, we mean the mechanism by which a server can reduce its power consumption by running at a lower clock frequency. Depending on the technology used, the power to frequency mapping for a server can vary. In these graphs, we have power consumed on the x-axis and server frequency on the y-axis. When we say frequency, we mean the speed of the server. Also, we will assume that we have a homogenous server farm. Thus, all the servers will have the same speed scaling technology. Of course, the workload running at the server farm can affect the way in which you will split power among the servers. This is because <CLICK> the workload can change the power to frequency relationship even for a given speed scaling technology. <CLICK> There are also other important factors such as arrival rate of the workload. Whether we have an open loop or a closed loop workload configuration. By open loop, we mean a server farm in which the arrivals are external to the system. This is similar to a web server. By a closed loop, we mean a server farm where the number of jobs in the farm is fixed. Think of this as a ** <CLICK> There are various other factors such as the maximum server speed, or the minimum server speed etc. We will take all these factors into account, when building our power efficient load balancer. P2 P3 q3

Outline Experimental Setup Power  Speed Speed  Response time
How power affects server speed for a single server Speed  Response time How response time of server farm depends on individual server speeds Here’s the outline for the rest of the talk. We’ll start by describing our experimental setup. <CLICK> Then we’ll move on to understanding how power affects the server speed for a single server. Obviously, as we allocate more power to a server, it will run faster. But exactly how fast will it run, depends on many factors. The power-to-speed relationship depends on the scaling technology and has many parameters. Also, when we say power, we mean system power, and not just the processor power. So it is not obvious how speed scaling affects the total power of the system. Thus, this understanding of the power-to-speed relationship is an important contribution of our work. Then we’ll look at how the performance depends on the individual server speeds of all the servers in the server farm. Finally, we’ll look at both theoretical and experimental results for optimal power allocation in server farms. Here, we’ll show you how to find the optimal power allocation for your server farm, as a function of all the inputs that we mentioned before. So let’s get started with our experimental setup. Optimal power allocation Theorems and Experiments

P Experimental Setup P1 P2 P3 Blade Intel Xeon 5000 series
3 GHz, quad core 4 GB RAM Scaling tech. DFS, DVFS, DVFS+DFS Workload CPU bound (LINPACK, DAXPY) Memory bound (STREAM) Other (WebBench, GZIP, BZIP2) IBM BladeCenter HS21 Rack with 7 blade servers P1 P As before, here’s our server farm setting. <CLICK> The rack of servers that we use is an IBM bladecenter HS21, with 7 blades. Each blade is an Intel Xeon 5000 series server with a 3 GHz quad core CPU and 4 GB of memory. Each blade is equipped with 3 different speed scaling technologies. We’ll talk about these in detail in the next few slides. Finally, we experiment with a bunch of workloads. These include CPU bound workloads such as Intel’s LINPACK and the DAXPY workload, memory bound workloads such as STREAM, and various other workloads. However, for this talk, we’ll only show results for CPU bound LINPACK and memory bound STREAM. POWER EFFICIENT LOAD BALANCER P2 P3

How power affects server speed for a single server Speed  Response time How response time of server farm depends on individual server speeds Next, I’ll talk about how power affects server speed within a single server. Optimal power allocation Theorems and Experiments

Our Experimental Results
How power affects server speed for a single server DFS: Dynamic Frequency Scaling Frequency (GHz) (server speed) DFS “linear” The first scaling technology we consider is DFS, Dynamic Frequency Scaling. Here, we lower the power consumption of the server, by reducing its clock frequency directly. As you can see from the graph, as power decreases, server frequency also decreases. The graph here is for CPU bound LINPACK. <CLICK> We analytically track this curve using a simple linear fit. Here s is the server speed or frequency, and P is the power coming into the server. The constants are as follows: Pmin is the lowest power in the graph, which is 180 watts. Smin is the lowest speed in the graph, which is around 1.2 GHz. Finally, alpha is the slope of the linear fit. Though it is widely believed in literature that power to frequency has a cubic relationship, we find that a linear fit works well for DFS. <CLICK> This is because we are considering system power, and not just the processor power, which has a cubic dependence on frequency theoretically. P = system power NOT processor power Power (Watts)

Our Experimental Results
How power affects server speed for a single server DVFS DVFS +DFS Frequency (GHz) “LINPACK” CPU BOUND DFS Frequency (GHz) Frequency (GHz) Power (Watts) Power (Watts) Power (Watts) The next speed scaling technology we consider is DVFS, which is shown here in blue. In DVFS, we lower the server frequency and the voltage together. This leads to a greater savings in power than DFS. Finally, we consider a mixture of DVFS and DFS, shown here in black. Note that DVFS+DFS looks more like a cubic than DFS and DVFS. <CLICK> So far, all these curves only deal with the CPU bound LINPACK workload. Recall that we are looking at system power and not just the processor power. We’ll now see how these graphs change, when we use a memory bound workload, STREAM. OUR MEASUREMENTS <CLICK> If you notice, the curves for STREAM are mostly cubic, even for DFS and DVFS. This is because of the following: At extremely low server frequencies, the bottleneck for STREAM is the CPU. So, every watt of power added to the system at such low frequencies, goes into improving the CPU clock speed. After a point, the bottleneck for STREAM becomes the memory subsystem. So, every watt of power added to the system at high frequencies is used up by the memory subsystem, and the improvement in CPU frequency is minimal. DVFS DVFS +DFS Frequency (GHz) DFS “STREAM” MEM BOUND Frequency (GHz) Frequency (GHz) Power (Watts) Power (Watts) Power (Watts)

How power affects server speed for a single server Speed  Response time How response time of server farm depends on individual server speeds Now that we have some understanding of how power relates to server speed within a single server, lets look at how the individual server speeds affect the response times of jobs in the server farm. This is non-trivial. To see how non-trivial, we have a pop quiz. Optimal power allocation Theorems and Experiments

Pop Quiz DVFS Results DFS Results High arrival rate
Given P = 720W and DVFS. Which allocation is better? 180|180|180|180 240| 240|240|0 Response Time (sec) 180 x 4 240 x 3 DVFS Results PowMin PowMax 2. Given P = 720W and DFS. Which allocation is better? 180|180|180|180 240| 240|240|0 PowMin PowMax Response Time (sec) 180 x 4 240 x 3 DFS Results Before we do so, let’s try and answer some questions about power allocation. <CLICK> First, assume you have a total power of 720 watts, and your servers are all using DVFS, which is the blue line in the graph. Which power allocation do you think would be better for minimizing mean response times? Your first choice is having 4 slow servers at 180watts each. We’ll call this choice as PowMin. Or, would you rather have 3 fast servers at 240watts each, which we’ll call PowMax. How about the same question <CLICK>, except this time your servers are all using DFS, the red line in the graph. PowMax is ten times better than PowMin !! If you look at the graph on the bottom left, for DFS, you’ll see that PowMin corresponds to a very low server frequency of around 1 GHz. So, 4 servers would mean a total of 4 GHz of speed. Whereas for DVFS, PowMin corresponds to almost 2.5 GHz, which is quite good. So, 4 servers would mean a total of 10 GHz of speed. So you might think this is the reason behind the results. Well, all these results were for high arrival rates. What happens when we look at low arrival rates ? (cont.)

Pop Quiz DVFS Results DFS Results Low arrival rate
Given P = 720W and DVFS. Which allocation is better? 180|180|180|180 240| 240|240|0 Response Time (sec) 180 x 4 240 x 3 DVFS Results PowMin PowMax 2. Given P = 720W and DFS. Which allocation is better? 180|180|180|180 240| 240|240|0 PowMin PowMax Response Time (sec) 180 x 4 240 x 3 DFS Results How about the case where we use DVFS, the blue line? <CLICK> Surprisingly, PowMin is now worse than PowMax. So we see a reversal of results for DVFS. How about DFS? Well, the results are the same as before for DFS !! If you found the results of the pop quiz surprising, we’ll now explain them and show you how a simple theoretical model allows us to predict all these results.

Abstract Model of Server Farm
Each server: Processor Sharing q1 s1 P1 P q2 POWER EFFICIENT LOAD BALANCER s2 To understand the effects of server speeds and arrival rate on the mean response time of the system, we need to build an abstract model of our farm. Here’s our familiar server farm, with power P coming into the system. <CLICK> The load balancer splits this power into individual server powers. Using queueing theory, we model each server as a queueing system. Each server does processor sharing among its jobs. This means that if we have n jobs at a server, they each receive one nth of the server speed. This is similar to running jobs on a UNIX time sharing machine. Now the power at each server corresponds to some server speed. We denote these speeds by s1, s2 and s3. Of course, we have some workload coming into the server farm. We model the arrival stream as a Poisson arrival process, with some rate of lambda jobs per second. Our load balancer will also output <CLICK> the fraction of load going into each server. P2 Poisson arrivals With rate λ jobs/sec s3 P3 q3

Response Time for Server Farm
(Mean Resp. Time) Using queueing theory, we can show that the mean response time of this system, is as follows. <CLICK> q1 divided by s1 minus lambda times q1, plus q2 divided by s2 minus lambda times q2 and so on. Observe that the mean response time is <CLICK> non-linear in server speeds and arrival rate. Thus, we can’t simply look at the sum of server speeds in PowMin and PowMax, in the quiz, and find the optimal power allocation. How about the arrival rate, lambda. If lambda is really low, we’ll have very few jobs in our server farm at any moment of time, right ? Thus, using PowMin, which leads to many slow servers, will result in poor utilization of some servers. This is why PowMax was preferred in the quiz for low arrival rates. When lambda is high, all servers are well utilized. So, finding the optimal power split is not straightforward. It will depend on other factors, such as the scaling technology used by the servers. RECALL POWMIN AND POWMAX GENERIC DFINTION Non-linear in si and qi If λ:low If λ:high PowMin PowMin results in poor utilization of some servers All server well utilized. Choice of PowMin vs. PowMax depends on scaling tech. PowMin PowMax

How power affects server speed for a single server Speed  Response time How response time of server farm depends on individual server speeds Now that we have some intuition of how power relates to speed, and how speed and arrival rate relate to mean response time, let us look at our results for optimal power allocation. Optimal power allocation Theorems and Experiments

Power Allocation Choices
PowMin DVFS Ex: P = 720W PowMin = 4 X 180 DFS DVFS +DFS Frequency (GHz) PowMax Ex: P = 720W PowMax = 3 X 240 We’ll first formally define certain power allocation choices. The graph here shows DFS, DVFS and DVFS+DFS for CPU bound LINPACK. We define three parameters for these graphs: Pmin is the lowest power consumption, which is 180watts. Pmax is the highest power consumption, which is 240watts. Pknee is the power consumption at the knee of DVFS+DFS. In our case, this is 210watts. <CLICK> We define PowMin as the power allocation where we run P over Pmin servers at power Pmin each, and all other servers are off. For example, if we had a total power of 720watts, PowMin would mean running 4 servers at 180watts each. Likewise, we define PowMax as running P over Pmax servers at Pmax watts each. This would mean 3 servers at 240watts each in our example. Finally, we define PowMed as running P over Pknee servers at Pknee watts each. Note that in certain cases, such as the last example, we don’t have enough power left to turn on additional servers. Throughout the rest of the talk, we will choose the total power P to be a near exact multiple of Pmin, Pmax and Pknee, so we don’t have wastage. PowMed Ex: P = 720W PowMed = 3 X 210

Power Allocation Theorems
OUTPUT Optimal Power Allocation INPUTS System Parameters linear steep flat cubic PowMin Speed scaling technology Workload type Pmin, Pmax Arrival rate: (2 regimes) Open vs. Closed workload configuration THEOREMS PowMax The optimal power allocation, as we have seen, depends on various system parameters, such as <CLICK> the speed scaling technology .. <CLICK> .. which can be variants of linear, or cubic .. <CLICK> .. on the workload type, since <CLICK> it affects the power-to-speed relationship <CLICK> Pmin and Pmax are also important, since they define our allocation choices. <CLICK> They can depend on the speed scaling technology. <CLICK> Clearly, the arrival rate is important. Our theorems show that it matters whether you are above a certain threshold arrival rate, which we derive, or below it. <CLICK> Another important factor is whether we have an open workload configuration or a closed workload configuration. For this talk, <CLICK> we’ll only look at open configurations. Theorems and experimental results for closed configurations can be found in our paper. We have come up with theorems <CLICK> that take all these factors into account <CLICK> and output the optimal power allocation. We have found that the optimal power allocation is one of PowMin, PowMax and PowMed only, and no other allocation. λ < λ0 λ ≥ λ0 PowMed

Power Allocation Results: Outline
CPU bound “LINPACK” Memory bound “STREAM” DFS DVFS DVFS+DFS The results will be presented as follows: We’ll first look at the CPU bound workload LINPACK. We’ll look at <CLICK> DFS, <CLICK> DVFS and <CLICK> DVFS+DFS. Then, we’ll look at the results for <CLICK> memory bound STREAM.

Power Allocation Results
CPU bound “LINPACK” Memory bound “STREAM” DFS DVFS DVFS+DFS Alright, so our first result deals with DFS, which is the red curve here. <CLICK> Our first theorem tells us that if the speed scaling is linear and steep, then you always want to use PowMax. Alpha here is the slope of the speed scaling. Since DFS is steep, we predict PowMax to be optimal for mean response time. Let’s see what our experiments tell us. The y-axis in this graph is the mean response time. Thus, lower the better. The total power value P, in this experiment was 720 watts. Thus PowMin would mean 4 servers at 180watts and PowMax would mean 3 servers at 240watts. As you can see, PowMax, the grey curve, is below PowMin, the green curve, for all arrival rates. Thus, we have rightly predicted the optimal power allocation in the case of DFS. Also notice that the difference in mean response time is as much as a factor of 5 for high arrival rates. DFS Frequency (GHz) Power (Watts)

CPU bound “LINPACK” Memory bound “STREAM” DFS DVFS DVFS+DFS Next, we consider DVFS, the blue curve. <CLICK> Our second theorem tells us that if the speed scaling is linear and flat, such as DVFS, then the optimal allocation depends on the arrival rate. If the arrival rate is low, then PowMax is optimal. However, if the arrival rate is high, then PowMin is optimal. Experimentally, we find that this is exactly the case. Note how the green line, PowMin, produces lower response times than the grey line, PowMax at high arrival rates. Again, our theorems have rightly predicted the optimal power allocation. DVFS Frequency (GHz) Power (Watts)

CPU bound “LINPACK” Memory bound “STREAM” DFS DVFS DVFS+DFS Finally, we look at DVFS+DFS, which is the downwards concave curve here. <CLICK> Our third theorem tells us that if the speed scaling is cubic, like DVFS+DFS, then the optimal allocation depends upon the arrival rate. If the arrival rate is low, then PowMax is optimal. If the arrival rate is high, then PowMed is optimal. That is exactly what we see in our experiments. Note how the brown line, PowMed, achieves lower mean response times than the grey line, PowMax, at high arrival rates. For the sake of completion, we also show PowMin, here in green. Note that PowMin is not good at all in this case. DVFS +DFS Frequency (GHz) Power (Watts)

DFS CPU bound “LINPACK” Memory bound “STREAM” DFS DVFS DVFS+DFS Mean Resp. Time (sec) Arrival rate (jobs/sec) DVFS DVFS+DFS For STREAM, recall that all the speed scaling mechanisms had a cubic relationship. For a cubic, we expect PowMax to be optimal at low arrival rates and PowMed to be optimal at high arrival rates. <CLICK> For the DFS and DVFS, this is exactly what we observe in our experiments, although the difference is very small. However, for DVFS+DFS, we find that PowMax, the grey line, is optimal throughout the range of arrival rates. This is because the threshold value above which PowMed is optimal is greater than the arrival rates we have here. Mean Resp. Time (sec) Mean Resp. Time (sec) Arrival rate (jobs/sec) Arrival rate (jobs/sec)

Conclusions: How to allocate power optimally
Speed Scaling? Linear, Steep Linear, Flat Cubic Arrival Rate? Arrival Rate? Arrival Rate? I would like to conclude this talk with a pictorial description of our optimal power allocation algorithm. This is quite easy to follow. Low High Low High Low High PowMax PowMax PowMax PowMin PowMax PowMed

Optimal Power Allocation in Server Farms

Similar presentations

Presentation on theme: "Optimal Power Allocation in Server Farms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimal Power Allocation in Server Farms

Similar presentations

Presentation on theme: "Optimal Power Allocation in Server Farms"— Presentation transcript:

Similar presentations

About project

Feedback