GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Parasol and GreenSwitch: Managing Datacenters Powered by Renewable Energy Íñigo Goiri, William Katsak, Kien Le, Thu D. Nguyen, and Ricardo Bianchini Department.

GreenSlot: Scheduling Energy Consumption in Green Datacenters Íñigo Goiri, Kien Le, Md. E. Haque, Ryan Beauchea, Thu D. Nguyen, Jordi Guitart, Jordi Torres,

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Multifaceted Resource Management in Virtualized Providers Íñigo Goiri PhD Defense June 14th, 2011 Advisors: Jordi Guitart and Jordi Torres.

Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura Keys, Randy Katz RAD Lab, UC Berkeley.

Electrical Billing and Rates MAE406 Energy Conservation in Industry Stephen Terry.

Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.

Intelligent Placement of Datacenters for Internet Services Íñigo Goiri, Kien Le, Jordi Guitart, Jordi Torres, and Ricardo Bianchini 1.

CoolAir Temperature- and Variation-Aware Management for Free-Cooled Datacenters Íñigo Goiri, Thu D. Nguyen, and Ricardo Bianchini 1.

Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Northwest Power and Conservation Council 6 th Plan Conservation Resource Supply Curve Workshop on Data & Assumption Overview of Council Resource Analysis.

Cost- and Energy-Aware Load Distribution Across Data Centers Presented by Shameem Ahmed Kien Le, Ricardo Bianchini, Margaret Martonosi, and Thu D. Nguyen.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Leveraging Renewable Energy in Data Centers Ricardo Bianchini on tour 2012.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

GreenSoftware: Managing Datacenters Powered by Renewable Energy Íñigo Goiri, William Katsak, Md E Haque, Kien Le, Ryan Beauchea, Jordi Guitart, Jordi Torres,

Building Green Cloud Services at Low Cost Josep Ll. Berral, Íñigo Goiri, Thu D. Nguyen, Ricard Gavaldà, Jordi Torres, Ricardo Bianchini.

Parasol: A Solar-Powered µDatacenter Íñigo Goiri Ricardo Bianchini, Thu D. Nguyen Team: Josep Lluis Berral, Md Haque, Bill Katsak, Kien Le Department of.

Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,

Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

GreenSched: An Energy-Aware Hadoop Workflow Scheduler

CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.

Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.

Copyright © 2011, Performance Evaluation of a Green Scheduling Algorithm for Energy Savings in Cloud Computing Truong Vinh Truong Duy; Sato,

Northwest Power and Conservation CouncilProCost Version 2.2 RTF July 2007.

Using Map-reduce to Support MPMD Peng

Matchmaking: A New MapReduce Scheduling Technique

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Scalable and Coordinated Scheduling for Cloud-Scale computing

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

BY: A. Mahmood, I. Khan, S. Razzaq, N. Javaid, Z. Najam, N. A. Khan, M. A. Rehman COMSATS Institute of Information Technology, Islamabad, Pakistan.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Using Map-reduce to Support MPMD Peng

Sunpyo Hong, Hyesoon Kim

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Resource Analysis. Objectives of Resource Assessment Discussion The subject of the second part of the analysis is to dig more deeply into some of the.

Zeta: Scheduling Interactive Services with Partial Execution Yuxiong He, Sameh Elnikety, James Larus, Chenyu Yan Microsoft Research and Microsoft Bing.

07/27/2006 Overview of Replacement Reserve Procurement ERCOT Staff PRS RPRS Task Force.

Operation and Control Strategy of PV/WTG/EU Hybrid Electric Power System Using Neural Networks Faculty of Engineering, Elminia University, Elminia, Egypt.

A Smartbox as a low-cost home automation solution for prosumers with a battery storage system in a demand response program G. Brusco, G. Barone, A. Burgio,

Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1

CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016

System Control based Renewable Energy Resources in Smart Grid Consumer

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

The Management of Renewable Energy

EE5900: Cyber-Physical Systems

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

CSP Grid Value of Energy Storage and LCOE Implications 26 August 2013

On Spatial Joins in MapReduce

Presentation transcript:

GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini

Motivation Datacenters consume large amounts of energy Energy cost is not the only problem – Brown sources: coal, natural gas… Connect datacenters to green sources – Solar panels, wind turbines… – Green datacenter – Early examples in the field 2

Green datacenter Energy sources – Solar/wind: variable over time – Electrical grid: backup Mitigation approaches are not ideal – Batteries and net metering We need to match the energy demand to the supply Power Time Load Solar power Workload 3

J3 Delaying load within time bounds J1J2 Nodes Power Time Nodes Power 4 Delay some jobs is OK (respecting time bounds) J2 J1

Scheduling data-processing workloads in green datacenters Data-processing jobs – Each task operates on a chunk of data – Data distributed among servers Simple workflow: MapReduce – Map tasks: process input data – Reduce tasks: merge maps’ outputs Challenges Match MapReduce workload with green energy availability – No information on #nodes, length, power… Conserve energy while ensuring data availability Map Reduce 6 7 Shuffle 5

Overview of GreenHadoop Predict solar energy availability May delay jobs but must meet time bounds – Maximize green energy use – If not enough green energy, minimize brown electricity cost – Brown energy cost + peak brown power cost Deactivate idle servers while keeping data available Divided into two parts 1.Computation scheduling 2.Data management 6

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Estimate the energy required by jobs (EWMA) Job3 Job1 Job4 Job5 Job6 Job2 7

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Power Time Now Assign green energy first Predict energy availability (weather forecast) On-peakOff-peak 8

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Time Now Assign cheap brown energy Power Previous peak On-peakOff-peak 9

1. Computation scheduling Job3 Job1 Job4 Job5 Job6 Job2 Time Now Assign expensive energy Power Active servers On-peakOff-peak 10 Current power → Active servers

1. Computation scheduling Time Now Active servers Power As time goes by… the number of active servers changes 11

2. Data management Deactivate servers to save energy – Some data might become unavailable Prior solution: covering subset [Leverich’09] – Set of servers always running has ALL data 12 Covering subset Our approach Only required data has to be available We usually require fewer active servers

2. Data management Server Active Decommission Down Server Server Required file Non-required file Server Server JobA 4 JobB 5 JobC 1 6 Running queue: 13

2. Data management Server Server Active Decommission Down GreenHadoop (computation) requires only 2 servers Server Server Server Required file Non-required file JobA 4 JobB 5 JobC 1 6 Running queue: 14

2. Data management Active Decommission Down Move required files to Active servers Server Server Server Server Server Replicate JobA 4 JobB 5 JobC 1 6 Running queue: 15

Server Data management Active Decommission Down Decommissioned server can be sent to Down Server Server Server Required file Non-required file 1 Server Server JobA 4 JobB 5 JobC 1 6 Running queue: 16

Server Data management Active Decommission Down Jobs to be executed change → Required files change Server Server Non-required file 1 Server Server JobA 4 JobB 5 JobC 1 6 JobD 8 Required file Running queue: 17

Server Server Data management Active Decommission Down Make missing data available Server Server Server Server Required file Non-required file JobB 5 JobC 1 JobD 8 Required file Running queue: 18

Server Server Data management Active Decommission Down Server Server Server GreenHadoop (computation) requires 3 servers Server Non-required file JobB 5 JobC 1 JobD 8 Required file Running queue: 19

Evaluation methodology Cluster with 16 Xeon servers – Hadoop and Hadoop turning off idle servers (EAHadoop) – GreenHadoop: green energy, brown electricity cost Energy profile – NJ electricity pricing (on/off peak and peak cost) – Solar farm energy availability (14 PV panels) – Five pairs of days (combinations of high and low days) Workload – Derived from Facebook [Zaharia’09] – Jobs with up to 37GB, 600 tasks, and 6 hours of length – Internal time bound of one day 20

Energy prediction vs actual rainthunderstormcloud cover 21

30 kWh 59 kWh $ kWh 25 kWh $ % 31% more green 39% cost savings GreenHadoop for Facebook & high-high days 22 Green consumed Brown consumed Brown price Green predicted Green produced

Different pairs of days Effect of parameters in GreenHadoop GreenHadoop for Facebook 23

Other results Workload intensity (datacenter utilization) High-priority jobs Shorter time bounds Data availability Workloads variations Consistent green energy increases and cost savings 24

Conclusions Data-processing scheduler for green datacenters Predicts green energy availability Increases the use of green energy Reduces brown electricity costs Manages data availability We are building Parasol – Solar-powered μdatacenter – Poster session 25

GreenHadoop: Leveraging Green Energy in Data-Processing Frameworks Íñigo Goiri, Kien Le, Thu D. Nguyen, Jordi Guitart, Jordi Torres, and Ricardo Bianchini

Dealing with electricity costs Schedule jobs: evaluate electricity cost – Green energy is “free” (amortization): $0.00/kWh – Cheap energy (11pm to 9am): $0.08/kWh – Expensive energy (9am to 11pm): $0.13/kWh – Off-peak power cost:$5.59/kW month – On-peak power cost:$13.61/kW month Optimization goal – Minimize electricity related costs while meeting deadlines 27

Our proposal: GreenHadoop Predict green energy availability – Weather forecast Schedule jobs – Maximize green energy use ($0/Wh) – If green not available, consume cheap brown ($/Wh on/off-peak) – When using brown, reduce peak power cost ($/W) Turn off idle servers to save energy Optimization goal – Minimize electricity related costs – May delay jobs but must meet deadlines – Guarantee data availability 28

Evaluation methodology Workloads – FaceD: GridMix derived from Facebook [Zaharia’09] – NutchI: crawling and indexing for Rutgers webpages Length – Tasks from 2 to 60 seconds – Jobs from 4 to 600 tasks – Some jobs take up to 6 hours using the whole our cluster Data – Files distributed in blocks of 64MB – Minimum of 2 replicas per block – Jobs use from 64MB to 37.50GB Default deadline of one day 29

Green datacenter Energy sources – Solar/wind: variable availability over time – Electrical grid: backup Other (problematic) approaches – Batteries: losses, cost, environmental – Bank energy on the grid: losses, cost, unavailability Wind Power Time Solar Power Wind Solar 30

1. Computation scheduling 1.Estimate energy required by jobs 2.Predict energy availability (weather forecast) 3.Schedule energy to minimize electricity costs 1.Assign green energy ($0/Wh) 2.Assign brown energy Cheap energy cost ($/Wh) Expensive energy cost ($/Wh) Peak-power cost ($/W) 4.Calculate current number of Active servers 5.Perform “2. Data management” 6.Submit jobs to execution 7.Send non-required servers to S3 to save energy 31

2. Data management We want to deactivate servers to save energy – Data is distributed among servers – Some data might be not available Common solution: Covering subset [Leverich’09] – ALL data must be always available – Minimum set of servers always running Our approach – Jobs running change → Required data change – Only required data has to be available – Move required data to Active servers – Decommission servers: provide data 32

Other results Workload intensity (datacenter utilization) – Works well with low/medium utilization – Similar to conventional under high utilization High-priority jobs – No performance degradation for high-priority jobs – Large amount of high-priority jobs reduce our benefits Shorter time bounds – 19% violations under really tight time bounds Data availability – Savings equal or higher than the covering subset Workloads variations – Nutch web-crawling and indexing – Consistent green energy increases and cost savings 33

Motivation Datacenters consume large amounts of energy Energy cost is not the only problem – Brown sources: coal, natural gas… Lots of small and medium datacenters – Consume the majority of electricity in DCs Connect datacenters to green sources – Solar panels, wind turbines… – Green datacenter 34

Delaying load within time bounds J1 J2 J3 J2 J3 Nodes Power Time Now J1 J2J3 Nodes Power 35 Delay some jobs is OK (respecting time bounds)