Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China fengliangcc@gmail.comfengliangcc@gmail.com, Dec 19 2012 @ ADC 2012

Outline  Why Energy?  Factors Affecting Energy Efficiency of MapReduce  Experimental Design  Analysis of Result  Key Finding and Recommendations  Conclusion and Future Work

Why energy?  Cooling  Cost  Enviormental Effect  Perfomance

Data Center: some numbers  The data center in Dallas, Oregon: ~50 MW Average electricity consumption in USA: ~900kwh/month/family, or 1.25KW  Power consumption is the major cost and constraint of data center  About 7000 data centers in USA  In US the data centers accounted for roughly 61 billion kWh (1.5% of the total U.S. electricity consumption) in 2006 (EPA 2007) The number is expected to be doubled by 2011

Green Computing in Cloud  Physical construction & Chip level  system software Virtiual Datacenter OS IBM: Power-Aware Request Distribution  Cluster level view Dynamic Resource Configuration Workload distribution …  Green application DBMS, MapReduce  Industrial standard Green Grid, PUE(Power Usage Effectiveness), DiCE

Why MapReduce?  MapReduce & Hadoop MapReduce popular, fashionable distributed processing model for parallel computing in data centers. Hadoop is an open-source implementation of MapReduce  New Challenges Little attention in the design of MapReduce platforms Perform automatic parallelization and distribution of computations MapReduce incorporates mechanisms to be resilient to failures

Aim to Answer 2 Questions  We aim to address the following two questions: Which factors affect the cluster-wise energy efficiency of a MapReduce platform? Is there any opportunity to perform tradeoff?

 Factors Affect Energy Efficiency Identify 4 factors that affect the energy efficiency of MapReduce:  CPU intensiveness, I/O intensiveness  Factors of the underlying distributed file system replica factor as well as the file block size Questions 1

 Is there any opportunity? Identify four typical workloads of MapReduce that present different kinds of application scenarios  TextWrite, WordCount, GrepSearch, Terasort Measuring the energy consumption with varied disparate cluster scales and other related factors Questions 2

 Energy Consumption Model for MapReduce  CPU intensiveness, I/O intensiveness  Replica Factors of the underlying distributed file system as well as the file block size Factors Affecting Energy Efficiency of MapReduce

 Metric Power, Time, Energy Energy Efficiency (EE)  Cluster Setup 2.4GHZ Intel Core Duo processor, 4GB RAM, 1000Mbps NetCard Hadoop-0.20.2 Experiment Design

 Workloads: TextWrite: Writes a large unsorted random sequence of words from a word-list. Network-intensive, map only job WordCount: Map-only CPU-intensive job. Matching regular expressions from input ﬁles. High CPU utilization in map stage GrepSearch: Balance between CPU-intensive(map stage) and I/O intensive jobs(reduce stage). High map/reduce ratio Terasort: Sorting the o ﬃ cial input datasets. CPU bound in map stage and I/O bound in reduce stage. Low map/reduce ratio Experiment Design

 Varied cluster parameters: Cluster size: 2~6 nodes Replica factor: 1~5 replicas Block size: 16MB~1GB Data size: 5~20GB Experiment Design

Analysis of Results  We run this four workloads with varied workload size, cluster scale, replica factor and block size

TextWrite(1)  It has almost a linear growth of both latency and energy consumption with the replica factor increasing

TextWrite(2)  With more nodes: great improvement of performance, 51.3%. Energy decreased from 71Wh to 49Wh  More nodes means the increase of power. But the response time reduction can trade-off the energy consumption

TextWrite(3)  Small block size enables more tasks to be processed in parallel.  When larger than 64MB, the parallelism of the system is reduced, so that energy consumption increases significantly

GrepSearch(1)  Higher degree of replica factor means more choices for tasks assignment, improving the load balancing of the system  HDFS replica placement policy not only improve data reliability, availability, but also improve the parallelism of the GrepSearch workload

GrepSearch(2)  More nodes means more resources and better performance  When the workload size is as small, the initial cost can not be amortized, and resources are sufficient. Thus, there is no obvious energy saving with the increase of the cluster size

GrepSearch(3)  With workload size increasing, the value of E-E is reduced  Small block size means large overhead on job initialization  Well-tuned block size can obtain energy saving by as much as 36.8%

TeraSort(1)  Can not see signiﬁcant changes of performance with varied replica factors  More replicas would improve the load balance of map tasks  But would be large data transfer in the shuffle stage, affecting the progress of the whole job

TeraSort(2)  With varied cluster size and workload size  Larger workload increase the job runtime, and more IT components will lead to more energy waste  Add more servers provide higher I/O throughput achieves energy consumption reduction by 20.2%

TeraSort(3)  Fig.3.c shows the value of E-E with varied cluster size and workload size  Small block size means ﬁne- grained input splits which will improve the performance of reduce sort  Big block size means less data shuffle

WordCount(1)  It employs almost 95% of the potential CPU  Increasing the degree of replica factor improve performance, parallelism  Large replica factors implies more opportunities for load balance

WordCount(2)  With more nodes added in the cluster, both response time and energy consumption decrease by 56.8% and 18.2% with 20GB data set

WordCount(3)  Small block size brings out high cost for tasks initialization  A large block size such as 1GB will make negative effects on parallelism of the system

Key Findings  With well-tuned system parameters and adaptive resource configurations, MapReduce cluster can achieve both performance improvement and good energy saving simultaneously in some instances  That is surprisingly contrast to previous works on cluster-level energy conservation.

Recommendations  For CPU intensive and high map/reduce ratio workloads, appropriate number of servers should be provided based on the workload size, ensuring load balance and adequate CPU resource  For I/O intensive workloads, fine-grained input splits is effective for shuffle and reduce stages, which is more energy efficient on condition that the initialization cost can be amortized  Improved data partitioning algorithms in map stage and content-ware reduce tasks scheduling strategies are key areas for energy efficiency, where refinements and improvements are needed.

Conclusion  We identified four factors that affect the energy efficiency of MapReduce based on the cluster  We chose four typical workloads of MapReduce and measured the energy consumption with varied disparate cluster scales and related factors  MapReduce cluster can achieve both significant energy saving and performance improvement simultaneously in some instances

Future Work  Verify the results of this paper in a larger size of clusters  More benchmarks of MapReduce should be introduced  Investigate the effects of changing other parameters: parallel reduce copies, memory limit, and file buffer size

Thank you Q&A

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Similar presentations

Presentation on theme: "Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Similar presentations

Presentation on theme: "Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19."— Presentation transcript:

Similar presentations

About project

Feedback