Presentation is loading. Please wait.

Presentation is loading. Please wait.

Addressing shared resource contention in datacenter servers Colloquium Talk by Sergey Blagodurov Stony Brook University Fall.

Similar presentations


Presentation on theme: "Addressing shared resource contention in datacenter servers Colloquium Talk by Sergey Blagodurov Stony Brook University Fall."— Presentation transcript:

1 Addressing shared resource contention in datacenter servers Colloquium Talk by Sergey Blagodurov Stony Brook University Fall 2013

2 My research (40,000 feet view) Talk by Sergey Blagodurov Stony Brook University Academic research at Simon Fraser University:  I am finishing my PhD with Prof. Alexandra Fedorova  My work is on scheduling in High Performance Computing (HPC) clusters I prototype better datacenters! Industrial research at Hewlett-Packard Laboratories:  I am Research Associate at Sustainable Ecosystems Research Group  My work is on designing a net-zero energy cloud infrastructure -2-

3 Why datacenters are important? Talk by Sergey Blagodurov Stony Brook University #1 Dematerialization Online shopping – less driving Working from home Digital content delivery -3-

4 Why datacenters are important? Talk by Sergey Blagodurov Stony Brook University #2 Moving into cloud -4-

5 Why datacenters are important? Talk by Sergey Blagodurov Stony Brook University #3 Increasing demand for supercomputers The biggest scientific discoveries Tremendous cost savings Medical innovations -5-

6 Why doing research in datacenters? Datacenters use lots of energy:  Consumption rose by 60% in the last five years  More than the entire country of Mexico!  now ~1-2% of world electricity Typical electricity costs per year:  Google (>500K servers, ~72MW): $38M  Microsoft (>200K servers, ~68MW): $36M  Sequoia (~100K nodes, 8MW): $7M Talk by Sergey Blagodurov Stony Brook University Datacenters consume lots of energy and its getting worse! Seawater hydro-electric storage on Okinawa, Japan -6-

7 Why doing research in datacenters?  23k cars in annual greenhouse gas emissions  CO 2 emissions from the electricity use of 15k homes for one year 20 MW 24/7 datacenter that is on for 1 year is equivalent to: Talk by Sergey Blagodurov Stony Brook University A single datacenter generates as much greenhouse gas as a small city! -7-

8 Where do datacenters spend energy? Talk by Sergey Blagodurov Stony Brook University Servers: 70-90% Cooling and other infrastructure: 10-30% CPU and Memory are the biggest consumers -8-

9 Memory Controller HyperTransport Shared L3 Cache System Request Interface Crossbar switch Core 0 L1, L2 cache Core 1 L1, L2 cache Core 2 L1, L2 cache Core 3 L1, L2 cache Memory node 0 NUMA Domain 0 to other domains An AMD Opteron 8356 Barcelona domain Talk by Sergey Blagodurov Stony Brook University -9-

10 An AMD Opteron system with 4 domains MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Talk by Sergey Blagodurov Stony Brook University -10-

11 Contention for the shared last-level cache (CA) MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Talk by Sergey Blagodurov Stony Brook University -11-

12 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Contention for the memory controller (MC) Talk by Sergey Blagodurov Stony Brook University -12-

13 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Contention for the inter-domain interconnect (IC) Talk by Sergey Blagodurov Stony Brook University -13-

14 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache Memory node 0 NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache Remote access latency (RL) A Talk by Sergey Blagodurov Stony Brook University -14-

15 MC HT Shared L3 Cache Core 0 L1, L2 cache Core 4 L1, L2 cache Core 8 L1, L2 cache Core 12 L1, L2 cache NUMA Domain 0 MC HT Shared L3 Cache Core 2 L1, L2 cache Memory node 2 NUMA Domain 2 MC HT Shared L3 Cache Core 3 L1, L2 cache Core 7 L1, L2 cache Core 11 L1, L2 cache Core 15 L1, L2 cache Memory node 1 NUMA Domain 1 MC HT Memory node 3 NUMA Domain 3 Core 6 L1, L2 cache Core 10 L1, L2 cache Core 14 L1, L2 cache Shared L3 Cache Core 1 L1, L2 cache Core 5 L1, L2 cache Core 9 L1, L2 cache Core 13 L1, L2 cache A B Memory node 0 Isolating Memory controller contention (MC) Talk by Sergey Blagodurov Stony Brook University -15-

16 Memory Controller (MC) and Interconnect (IC) contention are key factors hurting performance Dominant degradation factors Talk by Sergey Blagodurov Stony Brook University -16-

17 Characterization method  Given two threads, decide if they will hurt each other’s performance if co-scheduled Scheduling algorithm  Separate threads that are expected to interfere AB AB Contention-Aware Scheduling Talk by Sergey Blagodurov Stony Brook University -17-

18 Limited observability  We do not know for sure if threads compete and how severely! Trial and error infeasible on large systems  Can’t try all possible combinations  Even sampling becomes difficult A good trade-off: measure LLC Miss rate!  Threads interfere if they have high miss rates  No account for cache contention impact Characterization Method Talk by Sergey Blagodurov Stony Brook University -18-

19 Miss rate as a predictor for contention penalty Talk by Sergey Blagodurov Stony Brook University -19-

20 Goal: isolate threads that compete for shared resources and pull the memory to the local node upon migration ABCD Domain 1Domain 2Domain 1Domain 2 Migrate competing threads along with memory to different domains Memory node 1 MCHT Server-level scheduling AB Memory node 2 MCHT MCHT Memory node 2 Memory node 1 MCHT X Y A Y W Sort threads by LLC missrate: ABXY Talk by Sergey Blagodurov Stony Brook University CD Z W C D WZ X D Z B C -20-

21 Server-level results SPEC CPU 2006 SPEC MPI 2007 LAMP Talk by Sergey Blagodurov Stony Brook University -21-

22 datacenter network Memory node Node 0 Core Possibilities of datacenter-wide scheduling Memory node Core A A AA AAAA Memory node Node 3 Core Memory node Core A A AA AAAA Memory node Node 1 Core Memory node Core B C B C B C C B Memory node Node 2 Core Memory node Core DD D D Memory node Node 5 Core Memory node Core D D DD -22- Memory node Node 4 Core Memory node Core B C B C B C C B Talk by Sergey Blagodurov Stony Brook University

23 Clavis-HPC features Talk by Sergey Blagodurov Stony Brook University Contention-aware cluster scheduling:  See: online detection of contention, communication overhead, power consumption.  Think: approximate an optimal cluster schedule (cast the problem as a multi-objective one)  Do: use a low-overhead virtualization (OpenVZ) to migrate jobs across the nodes -23-

24 Enumeration tree search Talk by Sergey Blagodurov Stony Brook University Branch-and-Bound enumeration search tree: -24- Finding an optimal schedule:  an implementation using Choco solver  minimizes weighted sum:

25 Solver evaluation Talk by Sergey Blagodurov Stony Brook University Solver evaluation (custom branching strategy) -25-

26 Cluster-wide scheduling (a case for HPC) Talk by Sergey Blagodurov Stony Brook University -26- Vanilla HPC framework: Clavis-HPC:

27 Results Talk by Sergey Blagodurov Stony Brook University -27-

28 Talk by Sergey Blagodurov Stony Brook University What’s the impact? Faster execution saves money:  A datacenter with $30M electricity bill  20% less energy due to faster execution -28- $6M/year savings!

29 What’s next? Talk by Sergey Blagodurov Stony Brook University Eric Schmidt, former CEO of Google: Every two days now we create as much data as we did from the dawn of civilization up until Big MoneyBig Responsibility Big Data: -29-

30 Big Data has many facets Talk by Sergey Blagodurov Stony Brook University -30-

31 Use case: sensor data from a cross-country flight Talk by Sergey Blagodurov Stony Brook University -31-

32 Storage Future research directions Talk by Sergey Blagodurov Stony Brook University #1 Memory hierarchy in Exascale era -32- Memory node Compute node Memory node Core will turn into: Memory node Compute node Memory node Core FLASH Software defined storage PCRAM Memory node Core PCRAM Core Memory node Core PCRAM FLASH

33 or data analysis Future research directions Talk by Sergey Blagodurov Stony Brook University #2 Big Data placement Big Data analysis -33-

34 cloud? HPC cluster? warehouse? smth else? task Future research directions Talk by Sergey Blagodurov Stony Brook University #3 How to choose a datacenter for a given Big Data analytic task? -34-

35 Conclusion Talk by Sergey Blagodurov Stony Brook University In a nutshell:  Datacenters is the platform of choice  Datacenter servers are major energy consumers  The energy is wasted because of resource contention  I address the resource contention automatically and on-the-fly  Future plans: Big Data retrieval and analysis -35-

36 Any [time for] questions? Addressing shared resource contention in datacenter servers Talk by Sergey Blagodurov Stony Brook University

37 7). Users or sysadmins analyze the contention-aware resource usage report. 8). Users can checkpoint their jobs (OpenVZ snapshots). 9). Sysadmins can perform automated job migration across the nodes through OpenVZ live migration and are able to dynamically consolidate workload on fewer nodes, turn the rest off to save power. 7). Users or sysadmins analyze the contention-aware resource usage report. 8). Users can checkpoint their jobs (OpenVZ snapshots). 9). Sysadmins can perform automated job migration across the nodes through OpenVZ live migration and are able to dynamically consolidate workload on fewer nodes, turn the rest off to save power. 5). The virtualized jobs execute on the containers under the contention aware user-level scheduler (Clavis-DINO). They access cluster storage to get their input files and store the results. 2). Resource Manager (RM) on the head node receives the submission request and passes it to the Job Scheduler (JS). 3). JS determines what jobs execute on what containers and passes the scheduling decision to RM. 4). RM starts/stops the jobs on the given containers. 6). RM generates a contention-aware report about resource usage in the cluster during the last scheduling interval. 10). RM passes the contention-aware resource usage report to JS. 2). Resource Manager (RM) on the head node receives the submission request and passes it to the Job Scheduler (JS). 3). JS determines what jobs execute on what containers and passes the scheduling decision to RM. 4). RM starts/stops the jobs on the given containers. 6). RM generates a contention-aware report about resource usage in the cluster during the last scheduling interval. 10). RM passes the contention-aware resource usage report to JS. Clavis-HPC framework 1). User connects to the HPC cluster via client and submits a job with a PBS script. The user can characterize the job with a contention metric (devil, comm-devil). Clients (tablet, laptop, desktop, etc) Head node RM, JS, Clavis-HPC Centralized cluster storage (NFS, Lustre) Cluster network (Ethernet, InfiniBand) Monitoring (JS GUI), control (IPMI, iLO3, etc) Compute nodes contention monitors (Clavis) OpenVZ containers libraries (OpenMPI, etc) RM daemons (pbs_mom) Talk by Sergey Blagodurov Stony Brook University -37-

38 Clavis-HPC additional results Talk by Sergey Blagodurov Stony Brook University Results of the contention-aware experiments -38-

39 Cluster-wide scheduling (a case for HPC) Talk by Sergey Blagodurov Stony Brook University -39- Vanilla HPC framework: Clavis-HPC:

40 Where do datacenters spend energy? Talk by Sergey Blagodurov Stony Brook University Servers: 70-90% Cooling and other infrastructure: 10-30% CPU and Memory are the biggest consumers -40-

41 Critical (preferred access to resources):  RUBiS  WikiBench Non-critical:  Datacenter batch load: Swaptions Facesim FDS  HPC jobs: LU, BT, CG Cloud datacenter workloads Talk by Sergey Blagodurov Stony Brook University -41-

42 Automated collocation Talk by Sergey Blagodurov Stony Brook University Server under-utilization is a long standing problem:  Increases both CapEx and OpEx costs  Even for modern servers energy efficiency at 30% load can be less than half the efficiency at 100% load. Solution:  Collocate critical and non-critical applications.  Manage resource access through Linux control group mechanisms. Work-conserving vs. non work-conserving collocation:  managing with caps (limits) vs. managing with weights  (priorities) improving isolation vs. improving server utilization -42-

43 Talk by Sergey Blagodurov Stony Brook University What’s the impact? Automated collocation enables net-zero energy usage: -43-

44 Workload collocation using static prioritization Talk by Sergey Blagodurov Stony Brook University Scenario A (Swaptions, Facesim, FDS) Scenario B (LU, BT,CG) -44-

45 Workload collocation during spikes Talk by Sergey Blagodurov Stony Brook University Weight-based collocation: tolerable critical workload performance loss -45-

46 Workload collocation during spikes Talk by Sergey Blagodurov Stony Brook University A value twice as high for a process compared to another = twice as many CPU cycles -46-

47 cloud HPC cluster warehouse key/value? parallel databases? filesystem? Future research directions Talk by Sergey Blagodurov Stony Brook University #4 What storage organization is the most suitable for each datacenter type? -47-

48 Data warehouse project Talk by Sergey Blagodurov Stony Brook University Data assurance for power delivery networks Data warehouse Record from meter C Record from meter B Record from meter A Assurance rules Record from meter B Record from meter A C is broken -48-

49 LLC missrate works, but is not very accurate What if we want a metric that is more accurate?  Then we need to profile many performance counters simultaneously  … and we need to build a model that predicts the degradation.  We would have to train the model beforehand on a representative workload. The need of training the model is the price of higher accuracy! Increasing prediction accuracy Talk by Sergey Blagodurov Stony Brook University -49-

50 Our Solution Talk by Sergey Blagodurov Stony Brook University -50- Devising an accurate metric (outline)

51 Our Solution Talk by Sergey Blagodurov Stony Brook University -51- Devising an accurate metric (outline)

52 Talk by Sergey Blagodurov Stony Brook University -52- Devising an accurate metric (methodology)

53 Talk by Sergey Blagodurov Stony Brook University -53- Devising an accurate metric (methodology)

54 Talk by Sergey Blagodurov Stony Brook University -54- Devising an accurate metric (methodology)

55 Talk by Sergey Blagodurov Stony Brook University -55- Devising an accurate metric (methodology)

56 Our Solution Talk by Sergey Blagodurov Stony Brook University -56- Devising an accurate metric (model) REPTree module in Weka:  creates a tree with each attribute placed in a tree node  branches of the tree are values that this attribute takes  The leaf stores degradation (obtained on the training stage)

57 Intel Events :  340 Recordable core events, 19 Core events selected Average Prediction Error: 16% AMD Events:  208 Recordable Core events, 223 Recordable Chip Events  32 Core events selected, 8 Chip events selected Average Prediction Error: 13% Talk by Sergey Blagodurov Stony Brook University -57- Devising an accurate metric (results)


Download ppt "Addressing shared resource contention in datacenter servers Colloquium Talk by Sergey Blagodurov Stony Brook University Fall."

Similar presentations


Ads by Google