Big Data Analytics with Parallel Jobs

Big Data Analytics with Parallel Jobs
Ganesh Ananthanarayanan University of California at Berkeley

Rising philosophy of data-ism
Diagnostics and decisions backed by extensive data analytics "In God we trust. Everybody else bring data to the table.“ Competitive and Social Benefits Dichotomy: Ever-increasing data size and ever-decreasing latency target Slide to crystallize competing challenges

Parallelization Massive parallelization on large clusters
Data growing faster than Moore’s law Computation Frameworks Dryad

Computation Frameworks
Job  DAG of tasks Outputs of upstream tasks are passed downstream Job’s input (file) is divided among tasks Every task reads a block of data Computation and storage are co-located Tasks scheduled for locality Animate

Computation Frameworks
Task1 Task3 Task2 Animate Block Slot

Promise of Big Data Analytics
Goal: Efficient execution of jobs Completion time and utilization Challenge: Scale Data size and parallelism Performance variations  stragglers Efficient and fault-tolerant execution of parallel jobs on large clusters

IO intensive  Memory Tasks are IO intensive
Task completes faster if its input is read off memory instead of disk Memory local task Falling memory prices 64GB/machine at FB in Aug 2011, 256GB/machine not uncommon now

Can we move all data to memory?
Proposals for making “RAM as the new disk” (e.g., RAMCloud) Huge discrepancy between storage and memory capacities 200x more data on disks than available memory 600? x Use memory as cache

Will the inputs fit in cache?
Production traces Dryad and Hadoop clusters Thousands of machines Over 1 million jobs in all, span of 6 months ( ) Have a figure for 92%

Will the inputs fit in cache?
Heavy-tailed distribution of jobs sizes  92% of jobs can fit their inputs in memory Have a figure for 92%

We built a memory cache…
Simple in-memory distributed cache Cache input data of jobs Schedule tasks for memory locality Simple cache replacement policies Least Recently Used (LRU) and Least Frequently Used (LFU)

We built a memory cache…
Replayed the Facebook trace of Hadoop jobs Jobs sped up by 10% with LFU; hit-ratio of 67% Belady’s MIN (optimal) 13% improvement with hit-ratio of 74% LFU first, then MIN Give hit-ratio number Out of focus How do we make caching really speedup parallel jobs?

Parallel jobs require a new class of cache replacement algorithms

Parallel Jobs Tasks of small jobs run simultaneously in a wave
Task duration (uncached input) Task duration (cached input) slot2 slot1 time completion slot2 slot1 time completion slot2 slot1 time completion All-or-Nothing: Unless all inputs are cached, there is no benefit

All-or-Nothing for multi-waved jobs
Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job slot1 completion time slot2 slot3 Explain multi-waved with animation slot4 slot5 time

Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job slot1 completion time slot2 slot3 Animate waves and wave-widths slot4 slot5 time

Large jobs run tasks in multiple waves Number of tasks is larger than number of slots Wave-width: Number of parallel tasks of a job slot1 completion time Cache at the wave-width granularity slot2 slot3 slot4 slot5 time

Cache at the wave-width granularity
Job with 50 tasks, wave-width of 10 All-or-nothing

How to evict from cache? View at the granularity of a job’s input (file) Evict from incompletely cached waves– Sticky Policy Task duration (uncached input) (cached input) slot1 slot2 slot3 slot4 slot5 slot6 slot7 a slot8 slot1 slot2 slot3 slot4 slot5 slot6 slot7 slot8 completion Hit-ratio: 50% Job 1 speeds up Job 1 Job 2 Without Sticky Policy With Sticky Policy Job 1 Use hit-ratio of 50% - focus benefits to one job Job 2 completion Hit-ratio: 50% No speed-up of jobs

Which file should be evicted?
Depends on metric to optimize: User centric metric Completion time of jobs Operator centric metric Utilization of the cluster What are the eviction policies for these metrics?

Reduction in Completion Time
W D Idealized model for job: Wave-width for job: W Frequency predicts future access: F Task duration is proportional to data read: D Speedup factor for cached tasks: µ Cost of caching: W D Benefit of caching: Benefit/cost: µF/W µD F Completion Time of Job: frequency/wave-width

How to estimate W for a job?
Wave-width (slots) “conservative upper bound” Scatter plot for relative ordering Job size Use the size of a file as a proxy for wave-width Relative ordering remains unchanged

Improvement in Utilization
time W D Idealized model for job: Wave-width for job: W Frequency predicts future access: F Task duration is proportional to data read: D Speedup factor for cached tasks: µ Cost of caching: W D Benefit of caching: µD F Benefit/cost: µF W Utilization of job: frequency

Isn’t this just Least Frequently Used?
All-or-Nothing property matters for utilization Inter-dependent tasks overlap Reduce tasks start before all map tasks finish (to overlap communication) All-or-Nothing  No wastage!

Cache Eviction Policies
Completion time: LIFE (Largest Incomplete File to Evict) Evict from file with lowest (frequency/wave-width) Sticky: fully evict a file before next (all-or-nothing) Utilization policy: LFU-F (Least Frequently Used – File granularity) Evict from file with the lowest frequency Expand LIFE and LFU-F

How do we achieve the sticky policy?
Caches are distributed Blocks of files are stored across machines How do we know which files are incomplete? Coordination Global view of all the caches …which blocks to evict (sticky policy) …where to schedule tasks (memory locality) Emphasize challenge due to distribution

PA Man: Centralized Coordination
Global view DFS = Distributed File System DFS: Distributed File System

Evaluation Setup Workload derived from Facebook & Bing traces
Prototype in conjunction with HDFS Experiments on 100-node EC2 cluster Cache of 20GB per machine

Reduction in Completion Time
Replacement Policy Reduction in average completion time (%) Hit-Ratio (%) MIN 13% 59% LRU 9% 33% LFU 10% 48% LIFE 53% 45% Keep reminding that MIN is idealized Sticky Policy

Improvement in Utilization
Replacement Policy Improvement in utilization (%) LRU 13% LFU 46% MIN 51% LFU-F 54% Sticky Policy

Coordination infrastructure is sufficiently scalable
PA Man: Scalability PACMan Coordinator Near-uniform latency of ≤ 1.2ms with up to 11,000 clients/s PACMan Client Can scale up to 20 simultaneous tasks on the machine Coordination infrastructure is sufficiently scalable

How far is LIFE from the optimal for speeding up parallel jobs?

Caches for parallel jobs are complex!
Maximize # of “job hits” (a la Belady’s MIN) Job hit = 1 if all inputs are cached; = 0 otherwise Cache space: Aggregated vs. Distributed vs.

Maximize # of “job hits” (a la Belady’s MIN) Job hit = 1 if all inputs are cached; = 0 otherwise Cache space: Aggregated vs. Distributed Partial overlap of input blocks between jobs Job 1 Job 2

Maximize # of “job hits” (a la Belady’s MIN) Job hit = 1 if all inputs are cached; = 0 otherwise Cache space: Aggregated vs. Distributed Partial overlap of input blocks between jobs Inputs: Equal-sized vs. varying wave-width Blocks Jobs

Simplifying Assumptions
Cache space: Aggregated vs. Distributed At ~100% locality, aggregated ~ distributed Assume: Aggregated cache space Partial overlap of input blocks between jobs 98% of jobs have no partial overlap Assume: No partial overlap Inputs: Equal-sized vs. varying wave-width List the assumptions of “aggregated” and “no overlap” explicitly. Same color here

Notation Jobs indexed by 1, 2, …, n
Job i has wave-width wi and frequency fi Total cache space = C Single-waved jobs Do you need the figure? Be consistent with earlier figures

Problem Definition Job i has wave-width wi and frequency fi
Total cache space = C xi : if input of job i is in cache Maximize # “job hits” Maximize Σi fi xi Subject to Σi wi xi ≤ C xi = 0, 1 Solution to 0-1 knapsack is optimal cache configuration

Truncation Discard fractional item from linear relaxation
Theorem: Favoring (Frequency/wave-width) is (1-1/K) optimal, where K = C/(maxi wi). Comparing “job hits” (Σi fi xi) Linear Relaxation Truncation Discard fractional item from linear relaxation 0-1 Knapsack ≥ ≥ 0 ≤ xi ≤ 1 xi = 0, 1 Animate when the two parts of the proof arrive – (Freq/wave-width) and max w * (frequency/wave-width) is optimal solution *Discarded input is one with largest wave-width

LIFE + Optional Eviction
Theorem: LIFE + Optional Eviction is (1-1/K) optimal, where K = C/(maxi wi). Optional eviction = can evict input of incoming job (unlike traditional caches)

LIFE + Optional Eviction
Replacement Policy Reduction in average completion time (%) LIFE 53% LIFE + Optional Eviction 61% Optimal 63% Most jobs are small (heavy-tailed distribution) K  ∞ and (1-1/K)  1; LIFE + Optional Eviction  Optimal Bring out the fact that “optional eviction” actually is an improvement of a practical scheme Optional eviction is against classical etc. Optional Eviction: Practical improvement to LIFE

PA Man Highlights All-or-nothing property for parallel jobs
Cache at granularity of wave-widths PACMan: Coordinated in-memory caching system Cache replacement policies for completion time (LIFE) and utilization (LFU-F); not hit-ratio Optimality analysis; close approximation Bring out the fact that “optional eviction” actually is an improvement of a practical scheme

Straggler Mitigation Stragglers: slow tasks that delay job completion
All-or-Nothing  job as slow as the slowest task Mantri (for large production jobs) First work to present insights from the wild Cause-aware solution; in deployment at Bing Dolly (for small interactive jobs) Clone jobs upfront; pick first to finish Heavy-tailed job sizes  mitigate nearly all stragglers with just 3% extra resources

Promise of Big Data Analytics
Efficient and fault-tolerant execution of parallel jobs “PACMan: Coordinated Memory Caching for Parallel Jobs”, NSDI’12 “Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters”, EuroSys’11 “True Elasticity in Multi-Tenant Data-Intensive Compute Clusters”, SoCC’12 “Disk-Locality in Datacenter Computing Considered Irrelevant”, HotOS’11 “Reining in Outliers in MapReduce Clusters using Mantri”, OSDI’10 “Effective Straggler Mitigation: Attack of the Clones”, NSDI’13 “Why let resources idle? Aggressive Cloning of Jobs with Dolly”, HotCloud’12 Caching & Locality Straggler Mitigation

onclusion All-or-Nothing property of parallel jobs
PA Man: Coordinated Cache Management Provably optimal cache replacement Straggler mitigation Cause-based solution for large jobs, cloning for small jobs

Big Data Analytics with Parallel Jobs

Similar presentations

Presentation on theme: "Big Data Analytics with Parallel Jobs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Analytics with Parallel Jobs

Similar presentations

Presentation on theme: "Big Data Analytics with Parallel Jobs"— Presentation transcript:

Similar presentations

About project

Feedback