Presentation is loading. Please wait.

Presentation is loading. Please wait.

Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev.

Similar presentations

Presentation on theme: "Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev."— Presentation transcript:

1 Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev

2  1. Current energy issues with HDFS and large server farms  2. Past approaches and solutions for energy conservation and cost cut  3. GreenHDFS unique design and solution  4. Conclusions and references

3  The purpose of HDFS was to build a scalable file system run on large number of commodity servers – currently ~ 155,500 at Yahoo

4  Large number of servers generate heat and consume energy in very large quantities  Over the lifetime of a server, the operating energy cost is comparable to the initial acquisition costs and ownership costs grow – power, cooling etc.  A lot of efforts and research put into solution for energy-conservation for extremely large scale server farms

5  One of the commonly used is “Scale-down” approach– transitioning servers into low power consuming state  Example: Many datacenters transfer workloads and their state to a fewer number of servers during low activity hours  Problem? Above approach works only when servers are state-less – i.e. get all of their data from NAS/SAN

6  “Scale-down” approaches work only with NAS/SAN since all of the data is stored on dedicated storage devices – possible to migrate workload to fewer number

7  Hadoop distributes all of its files among many server – any of the thousand nodes can be participating at any moment

8  Self-adaptive – depends only on HDFS and file access patterns  Applies Data-Classification techniques  Does energy-aware placement of data  Trades cost, performance, and power by separating cluster into logical zones

9  Team did a detailed analysis of files in a production Yahoo! Hadoop cluster:  Files are heterogeneous in access and lifespan patterns – some are rarely accessed, some get deleted shortly, some stay a while  60% of data is “cold” or dormant – meaning lying without getting accessed – “need to exist for history files”

10  95-98% of files had a very short “hotness” lifespan of less than 3 days – meaning it was actively used during the first 3 days  90% of files in the top-level directory were dormant or “cold” for more than 18 days  Majority of the data had a news-server-like access pattern – where most of the computation happens soon after its creation

11  GreenHDFS organizes servers into logical Hot and Cold Zones using different policies – FMP, SCP, FRP FMP Performance, Cost and Power

12  The goal of GreenHDFS is to have maximum number of servers in the Hot Zone and minimize the number in the Cold Zone  Servers in Cold Zone are storage-heavy  GreenHDFS heavily relies on the “temperature” of the files – higher the dormancy ( rarely accessed) the lower the temperature and vice versa  Dormancy is determined simply by getting the last access information upon file read 

13  FMP monitors the dormancy of the files and runs in the Hot Zone  This gives higher storage efficiency for the Hot Zone as less accessed files are moved to the Cold zone  Also gives significant energy-conservation Hot Zone Heavy Computations FMP Cold Zone Idle Servers Coldness > Threshold Hotness > Threshold

14  SCP runs in the Cold Zone and determines which servers can go to standby/sleep mode  SCP uses hardware techniques to transfer CPU, Disks and DRAM into low power state  SCP wakes the server up only if: ◦ Data on that server is accessed ◦ New data needs to placed on that server

15  FRP runs in the Cold Zone and ensures that QoS, bandwidth, and response time is managed well if the files become “popular”  If the number of accesses to certain file becomes higher than the threshold – then file metadata is changed and gets “moved” to the Hot Zone  All the threshold values of FMP,SCP, FRP should be chosen so that it results in maximum energy efficiency

16  File goes to several stages in its lifetime: ◦ File Creation – just created ◦ Hot period – frequently used ◦ Dormant period – not accessed ◦ Deletion  GreenHDFS introduced various lifespan metrics and analyzed lifespan distributions to determine optimal threshold values for their policies ◦ FileLifeSpanCFR - file creation to first read ◦ FileLifeSpanCLR – file creation to last read ◦ FileLifeSpanLRD – last read access and deletion ◦ FileLifeSpanFLR – first read access and last read ◦ FileLifeTime - from the creation to deletion


18  Majority of files have short hotness lifespan

19  80% of files in d have dormancy period > 20 days

20  Simulation to test energy-conservation

21  24 % reduction in energy consumption ~ $2.1 million for 38,000 servers or $8.5 million saved on 155K servers today

22  More servers and space available = better performance

23  GreenHDFS is a policy-driven, self-adaptive, variant of HDFS  It relies on data classification driven data placement that gives significant periods of idleness on a subset of servers  It categorizes files into 2 zones: Hot and Cold  Applies sets of policies to classify files into Hot and Cold

24  Energy consumption reduced by 24% and saved $2.1ml for 38,000 servers at that time. Today could be more than $8.5 million saved  Storage efficiency also increased since dormant files get moved to the Cold Zone  More space and better utilization of Hot Zone leads to better performance for HDFS/MapReduce

25  resentations/papers/kaushik.pdf resentations/papers/kaushik.pdf   

Download ppt "Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev."

Similar presentations

Ads by Google