Big Data Analytics with Parallel Jobs

Slides:



Advertisements
Similar presentations
Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.
Advertisements

Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.
GRASS: Trimming Stragglers in Approximation Analytics Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu.
SDN + Storage.
Distributed Systems CS
Effective Straggler Mitigation: Attack of the Clones [1]
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
COS 461 Fall 1997 Workstation Clusters u replace big mainframe machines with a group of small cheap machines u get performance of big machines on the cost-curve.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Disk-Locality in Datacenter Computing Considered Irrelevant Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica 1.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
A Hybrid Caching Strategy for Streaming Media Files Jussara M. Almeida Derek L. Eager Mary K. Vernon University of Wisconsin-Madison University of Saskatchewan.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
The Power of Choice in Data-Aware Cluster Scheduling
The Case for Tiny Tasks in Compute Clusters Kay Ousterhout *, Aurojit Panda *, Joshua Rosen *, Shivaram Venkataraman *, Reynold Xin *, Sylvia Ratnasamy.
New Challenges in Cloud Datacenter Monitoring and Management
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Introduction to Hadoop and HDFS
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.
1 Hidra: History Based Dynamic Resource Allocation For Server Clusters Jayanth Gummaraju 1 and Yoshio Turner 2 1 Stanford University, CA, USA 2 Hewlett-Packard.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
MiddleMan: A Video Caching Proxy Server NOSSDAV 2000 Brian Smith Department of Computer Science Cornell University Ithaca, NY Soam Acharya Inktomi Corporation.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
BIG DATA/ Hadoop Interview Questions.
Performance Assurance for Large Scale Big Data Systems
CMSC 611: Advanced Computer Architecture
GRASS: Trimming Stragglers in Approximation Analytics
Hippo: An Enhancement of Pipeline-aware In-memory Caching for HDFS
Big Data is a Big Deal!.
OPERATING SYSTEMS CS 3502 Fall 2017
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Curator: Self-Managing Storage for Enterprise Clusters
Distributed Network Traffic Feature Extraction for a Real-time IDS
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Data Dissemination and Management (2) Lecture 10
Parallel Data Laboratory, Carnegie Mellon University
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
PA an Coordinated Memory Caching for Parallel Jobs
On the Scale and Performance of Cooperative Web Proxy Caching
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Optimizing Interactive Analytics Engines for Heterogeneous Clusters
MapReduce Simplied Data Processing on Large Clusters
April 30th – Scheduling / parallel
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
湖南大学-信息科学与工程学院-计算机与科学系
On Spatial Joins in MapReduce
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Cloud Computing Large-scale Resource Management
Cloud Computing MapReduce in Heterogeneous Environments
CENG 351 Data Management and File Structures
Virtual Memory: Working Sets
COS 518: Distributed Systems Lecture 11 Mike Freedman
Data Dissemination and Management (2) Lecture 10
Presentation transcript:

Big Data Analytics with Parallel Jobs Ganesh Ananthanarayanan University of California at Berkeley http://www.cs.berkeley.edu/~ganesha/

Rising philosophy of data-ism Diagnostics and decisions backed by extensive data analytics  "In God we trust. Everybody else bring data to the table.“ Competitive and Social Benefits Dichotomy: Ever-increasing data size and ever-decreasing latency target Slide to crystallize competing challenges

Parallelization Massive parallelization on large clusters Data growing faster than Moore’s law Computation Frameworks Dryad

Computation Frameworks Job  DAG of tasks Outputs of upstream tasks are passed downstream Job’s input (file) is divided among tasks Every task reads a block of data Computation and storage are co-located Tasks scheduled for locality Animate

Computation Frameworks Task1 Task3 Task2 Animate Block Slot

Promise of Big Data Analytics Goal: Efficient execution of jobs Completion time and utilization Challenge: Scale Data size and parallelism Performance variations  stragglers Efficient and fault-tolerant execution of parallel jobs on large clusters

IO intensive  Memory Tasks are IO intensive Task completes faster if its input is read off memory instead of disk Memory local task Falling memory prices 64GB/machine at FB in Aug 2011, 256GB/machine not uncommon now

Can we move all data to memory? Proposals for making “RAM as the new disk” (e.g., RAMCloud) Huge discrepancy between storage and memory capacities 200x more data on disks than available memory 600? 200-400x Use memory as cache

Will the inputs fit in cache? Production traces Dryad and Hadoop clusters Thousands of machines Over 1 million jobs in all, span of 6 months (2011-12) Have a figure for 92%

Will the inputs fit in cache? Heavy-tailed distribution of jobs sizes  92% of jobs can fit their inputs in memory Have a figure for 92%

We built a memory cache… Simple in-memory distributed cache Cache input data of jobs Schedule tasks for memory locality Simple cache replacement policies Least Recently Used (LRU) and Least Frequently Used (LFU)

We built a memory cache… Replayed the Facebook trace of Hadoop jobs Jobs sped up by 10% with LFU; hit-ratio of 67% Belady’s MIN (optimal) 13% improvement with hit-ratio of 74% LFU first, then MIN Give hit-ratio number Out of focus How do we make caching really speedup parallel jobs?

Parallel jobs require a new class of cache replacement algorithms

Parallel Jobs Tasks of small jobs run simultaneously in a wave Task duration (uncached input) Task duration (cached input) slot2 slot1 time completion slot2 slot1 time completion slot2 slot1 time completion All-or-Nothing: Unless all inputs are cached, there is no benefit

All-or-Nothing for multi-waved jobs Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job slot1 completion time slot2 slot3 Explain multi-waved with animation slot4 slot5 time

All-or-Nothing for multi-waved jobs Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job slot1 completion time slot2 slot3 Explain multi-waved with animation slot4 slot5 time

All-or-Nothing for multi-waved jobs Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job slot1 completion time slot2 slot3 Animate waves and wave-widths slot4 slot5 time

All-or-Nothing for multi-waved jobs Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job slot1 completion time Cache at the wave-width granularity slot2 slot3 slot4 slot5 time

Cache at the wave-width granularity Job with 50 tasks, wave-width of 10 All-or-nothing

How to evict from cache? View at the granularity of a job’s input (file) Evict from incompletely cached waves– Sticky Policy Task duration (uncached input) (cached input) slot1 slot2 slot3 slot4 slot5 slot6 slot7 a slot8 slot1 slot2 slot3 slot4 slot5 slot6 slot7 slot8 completion Hit-ratio: 50% Job 1 speeds up Job 1 Job 2 Without Sticky Policy With Sticky Policy Job 1 Use hit-ratio of 50% - focus benefits to one job Job 2 completion Hit-ratio: 50% No speed-up of jobs

Which file should be evicted? Depends on metric to optimize: User centric metric Completion time of jobs Operator centric metric Utilization of the cluster What are the eviction policies for these metrics?

Reduction in Completion Time W D Idealized model for job: Wave-width for job: W Frequency predicts future access: F Task duration is proportional to data read: D Speedup factor for cached tasks: µ Cost of caching: W D Benefit of caching: Benefit/cost: µF/W µD F Completion Time of Job: frequency/wave-width

How to estimate W for a job? Wave-width (slots) “conservative upper bound” Scatter plot for relative ordering Job size Use the size of a file as a proxy for wave-width Relative ordering remains unchanged

Improvement in Utilization time W D Idealized model for job: Wave-width for job: W Frequency predicts future access: F Task duration is proportional to data read: D Speedup factor for cached tasks: µ Cost of caching: W D Benefit of caching: µD F Benefit/cost: µF W Utilization of job: frequency

Isn’t this just Least Frequently Used? All-or-Nothing property matters for utilization Inter-dependent tasks overlap Reduce tasks start before all map tasks finish (to overlap communication) All-or-Nothing  No wastage!

Cache Eviction Policies Completion time: LIFE (Largest Incomplete File to Evict) Evict from file with lowest (frequency/wave-width) Sticky: fully evict a file before next (all-or-nothing) Utilization policy: LFU-F (Least Frequently Used – File granularity) Evict from file with the lowest frequency Expand LIFE and LFU-F

How do we achieve the sticky policy? Caches are distributed Blocks of files are stored across machines How do we know which files are incomplete? Coordination Global view of all the caches …which blocks to evict (sticky policy) …where to schedule tasks (memory locality) Emphasize challenge due to distribution

PA Man: Centralized Coordination Global view DFS = Distributed File System DFS: Distributed File System

Evaluation Setup Workload derived from Facebook & Bing traces Prototype in conjunction with HDFS Experiments on 100-node EC2 cluster Cache of 20GB per machine

Reduction in Completion Time Replacement Policy Reduction in average completion time (%) Hit-Ratio (%) MIN 13% 59% LRU 9% 33% LFU 10% 48% LIFE 53% 45% Keep reminding that MIN is idealized Sticky Policy

Improvement in Utilization Replacement Policy Improvement in utilization (%) LRU 13% LFU 46% MIN 51% LFU-F 54% Sticky Policy

Coordination infrastructure is sufficiently scalable PA Man: Scalability PACMan Coordinator Near-uniform latency of ≤ 1.2ms with up to 11,000 clients/s PACMan Client Can scale up to 20 simultaneous tasks on the machine Coordination infrastructure is sufficiently scalable

How far is LIFE from the optimal for speeding up parallel jobs?

Caches for parallel jobs are complex! Maximize # of “job hits” (a la Belady’s MIN) Job hit = 1 if all inputs are cached; = 0 otherwise Cache space: Aggregated vs. Distributed vs.

Caches for parallel jobs are complex! Maximize # of “job hits” (a la Belady’s MIN) Job hit = 1 if all inputs are cached; = 0 otherwise Cache space: Aggregated vs. Distributed Partial overlap of input blocks between jobs Job 1 Job 2

Caches for parallel jobs are complex! Maximize # of “job hits” (a la Belady’s MIN) Job hit = 1 if all inputs are cached; = 0 otherwise Cache space: Aggregated vs. Distributed Partial overlap of input blocks between jobs Inputs: Equal-sized vs. varying wave-width Blocks Jobs

Simplifying Assumptions Cache space: Aggregated vs. Distributed At ~100% locality, aggregated ~ distributed Assume: Aggregated cache space Partial overlap of input blocks between jobs 98% of jobs have no partial overlap Assume: No partial overlap Inputs: Equal-sized vs. varying wave-width List the assumptions of “aggregated” and “no overlap” explicitly. Same color here

Notation Jobs indexed by 1, 2, …, n Job i has wave-width wi and frequency fi Total cache space = C Single-waved jobs Do you need the figure? Be consistent with earlier figures

Problem Definition Job i has wave-width wi and frequency fi Total cache space = C xi : if input of job i is in cache Maximize # “job hits” Maximize Σi fi xi Subject to Σi wi xi ≤ C xi = 0, 1 Solution to 0-1 knapsack is optimal cache configuration

Truncation Discard fractional item from linear relaxation Theorem: Favoring (Frequency/wave-width) is (1-1/K) optimal, where K = C/(maxi wi). Comparing “job hits” (Σi fi xi) Linear Relaxation Truncation Discard fractional item from linear relaxation 0-1 Knapsack ≥ ≥ 0 ≤ xi ≤ 1 xi = 0, 1 Animate when the two parts of the proof arrive – (Freq/wave-width) and max w * (frequency/wave-width) is optimal solution *Discarded input is one with largest wave-width

LIFE + Optional Eviction Theorem: LIFE + Optional Eviction is (1-1/K) optimal, where K = C/(maxi wi). Optional eviction = can evict input of incoming job (unlike traditional caches)

LIFE + Optional Eviction Replacement Policy Reduction in average completion time (%) LIFE 53% LIFE + Optional Eviction 61% Optimal 63% Most jobs are small (heavy-tailed distribution) K  ∞ and (1-1/K)  1; LIFE + Optional Eviction  Optimal Bring out the fact that “optional eviction” actually is an improvement of a practical scheme Optional eviction is against classical etc. Optional Eviction: Practical improvement to LIFE

PA Man Highlights All-or-nothing property for parallel jobs Cache at granularity of wave-widths PACMan: Coordinated in-memory caching system Cache replacement policies for completion time (LIFE) and utilization (LFU-F); not hit-ratio Optimality analysis; close approximation Bring out the fact that “optional eviction” actually is an improvement of a practical scheme

Straggler Mitigation Stragglers: slow tasks that delay job completion All-or-Nothing  job as slow as the slowest task Mantri (for large production jobs) First work to present insights from the wild Cause-aware solution; in deployment at Bing Dolly (for small interactive jobs) Clone jobs upfront; pick first to finish Heavy-tailed job sizes  mitigate nearly all stragglers with just 3% extra resources

Promise of Big Data Analytics Efficient and fault-tolerant execution of parallel jobs “PACMan: Coordinated Memory Caching for Parallel Jobs”, NSDI’12 “Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters”, EuroSys’11 “True Elasticity in Multi-Tenant Data-Intensive Compute Clusters”, SoCC’12 “Disk-Locality in Datacenter Computing Considered Irrelevant”, HotOS’11 “Reining in Outliers in MapReduce Clusters using Mantri”, OSDI’10 “Effective Straggler Mitigation: Attack of the Clones”, NSDI’13 “Why let resources idle? Aggressive Cloning of Jobs with Dolly”, HotCloud’12 Caching & Locality Straggler Mitigation

onclusion All-or-Nothing property of parallel jobs PA Man: Coordinated Cache Management Provably optimal cache replacement Straggler mitigation Cause-based solution for large jobs, cloning for small jobs