OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters

OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters
Muntasir Raihan Rahman, Nokia Bell Labs Indranil Gupta, University of Illinois Urbana-Champaign Akash Kapoor, Princeton University Haozhen Ding, Airbnb Distributed Protocols Research Group (DPRG)

OPTiC: Opportunistic graph Processing on Multi-Tenant Clusters
OPTiC is the first multi-tenant system for graph processing OPTiC bridges the gap between graph processing layer and cluster scheduler layer Key techniques New algorithm for graph computation progress estimation Smart prefetching of resources We implemented our system on top of Apache Giraph + YARN stack We obtain 20-82% improvement in job completion time for realistic workloads under realistic network conditions

Graphs are Ubiquitous Biological Man-made
Food Web Protein Interaction Network Metabolic Network Man-made Online Social Network (OSN) Web Graph The Internet Protein Interaction Network The Internet Graph Graphs are Massive Scale: Facebook Graph: |V|=1.1B, |E|=150B (May 2013)

Distributed Graph Processing
Apache Giraph Dato PowerGraph PowerLyra Popular computational framework for large scale graph processing Google Pregel Databricks GraphX

Anatomy of a Graph Processing Job
Graph Preprocessing (1) load from disk (2) partition Preprocessing time included in total job turnaround time Can be significant 2013] Termination write results to disk teardown Graph Computation (Gather-Apply-Scatter) Synchronize at barrier

Graph Processing on Multi-tenant Clusters
Graph Processing Engines do not take advantage of multi-tenancy in cluster scheduler Apache Giraph GraphX GAP Cluster Schedulers un-aware of graph nature of jobs Only assume map-reduce or similar abstractions Apache YARN Mesos Benefits of multi-tenancy: consolidation of multiple workloads, reduction in capex and opex costs, so it is a natural deployment plan for graph processing

OPTiC: Opportunistic Graph Processing on Multi-Tenant Clusters
Key Idea: Opportunistic Overlapping of Graph Preprocessing Phase of Waiting Jobs with (2) Graph Computation Phase of Current Jobs System Assumptions Synchronous graph processing (workers sync periodically) Over-subscribed cluster (always a waiting job) No pre-emption All input graphs stored in Distributed File System (e.g., HDFS) Disk locality matters

Key Idea, Simplified: Opportunistic Overlapping
Job 2 90% complete Start preprocessing phase of next waiting job at cluster resources running maximum progress job (MPJ) Job 1 70% complete Cluster Scheduler Benefits: MPJ most likely to free up cluster resources first When the next waiting job is scheduled, preprocessing phase is already underway Transition to next slide: but estimating progress is not trivial, and I will talk about metrics in a bit. In the mean-time let me first talk about how to do prefetching Job 3 50% complete 1# Prefetching Resources Challenges 2# Estimating Progress

Challenge # 1: How to Prefetch
Desired Feature: Minimal Interference on Current Running Jobs Progress-Aware Memory Prefetching Prefetch graph of waiting job directly into memory of MPJ server(s) MPJ server memory being used to store and compute on MPJ graph Interferes with MPJ, potentially increase MPJ run-time Progress-Aware Disk Prefetching (PADP) Prefetch graph of waiting job into disk of MPJ server(s) No interference with MPJ memory When MPJ done, waiting job loads graph from local disk instead of remote disk Local disk fetch avoids network contention Cluster data (Amazon 20 server virtual cluster) Amazon EC2 disk bandwidth mean MB/s Amazon EC2 network bandwidth mean 73.2 MB/s Cheaper to fetch from local disk than from network MPJ=Max Progress Job

Architecture: OPTiC with PADP
Graph Processing Engine Progress Estimation Engine OPTiC Scheduler Central Job Queue Cluster Scheduler We describe the architecture of OPTiC and how it interacts with a cluster scheduler Distributed File System Replica Placement Engine

OPTiC-PADP Scheduling Algorithm
OPTiC scheduler For next waiting job in queue Fetch progress information of running jobs Determine server(s) S running MPJ (Maximum Progress Job) Tell DFS to create additional replica of next waiting job graph in disks of S Running Job Periodically send progress information to OPTiC Creating additional replicas in disk increases the (non-zero) storage performance cost But there is a lot of available space on disks, which are mostly under-utilized So the actual dollar cost of the system is close to zero Thus a critical decision to make is to compare running jobs in terms of progress. Thus for any two jobs, we need to infer which job has made more progress. This leads to our progress comparison algorithm. We now go over various methods of progress estimation and discuss how we arrived at our final coarse grained approach Cluster Scheduler Scheduled next waiting job when MPJ finishes Next Waiting Job Scheduled on S Fetch graph from local disk instead of remote disk in DFS

Challenge # 2: Estimating Progress of Graph Computation
Profiling: Profile the run-time of various graph algorithms on different cluster configurations for different graph sizes Huge overhead, job details dependent (-) Use Cluster Scheduler Progress Estimator: For example Giraph programs are mapped to map-reduce programs Use cluster map-reduce progress estimator to estimate graph computation progress Cluster dependent (-) Our second challenge was estimating job progress, we now come back to that. Profiling has high overhead, and needs to be repeated for every new algorithm; cluster scheduler metrics depend on cluster job abstraction, but estimating progress of map-reduce phases leads to our next idea, graph level metrics are profile and cluster agnostic

Profile-free, Cluster-agnostic Progress Estimation
Use Graph Processing Layer Metrics: Track the evolution of active vertex count (AVC) A vertex is active as long as there are some incoming messages from previous iteration At termination AVC = 0 Profile-independent, Cluster-agnostic (+) Our second challenge was estimating job progress, we now come back to that. Profiling has high overhead, and needs to be repeated for every new algorithm; cluster scheduler metrics depend on cluster job abstraction, but estimating progress of map-reduce phases leads to our next idea, graph level metrics are profile and cluster agnostic

Evolution of AVCP=AVC/N
Flat Increasing SSSP Pagerank Decreasing Initial non-decreasing phase: AVCP at or going towards 1 Decreasing phase: AVCP going towards 0 Decreasing K-core decomposition Flat Connected Comp We benchmarked the evolution of AVCP for many well-known graph algorithms run atop graph processing engines, and found some common patterns. This led us to approximate graph evolution with an optional non-decreasing phase, followed by an increasing phase. Thus we can compare jobs by first comparing which phases for each job, and if in the same phase, we compare progress within that phase We have flat-decreasing, increasing-decreasing, null-decreasing, we generalize this trend to non-increasing-decreasing, and approximate with 2 piece-wise linear segments. We thus assume constant rate of change within each segment. Techniques that model variable rate of AVCP with acceleration can be done, but we leave it as future work. Also we found good performance with our current approximation, thus we did not feel the need to complicate the scheme. Decreasing Progress Measure: How far from final AVCP=0% Decreasing

Progress Comparator Algorithm
MPJ = Max Progress Job Non-decreasing Decreasing Job1 AVCP 0% 70% 100% 0% CASE 1: Jobs in different phases Job2 AVCP 0% 100% 50% 0% Job2 in 2nd Decreasing Phase: MPJ

Progress Comparator Algorithm (2)
MPJ = Max Progress Job Non-decreasing Decreasing Job1 AVCP 0% 70% 100% 0% CASE 2: Both jobs in Non-dec phase Job2 AVCP 0% 20% 100% 0% Instead of fine-grained comparison within phases, we divide the phase into 3 intervals, and check which job is in a later interval within the same phase % doesn’t work since it is too fine-grained. So our scheme lets two jobs at 60% and 70% within the same phase to be considered equivalent, but two jobs at 20% and 70% are not equivalent. What about one job at the end of phase 1, 90%, and another job at the start of phase 2 (10%) as considered equivalent? If we do that why bother with the distinct phases, why not just have a single phase sub-divided into sub-regions. L M H The intervals introduce some randomness for jobs with AVCP close to each other (e.g., if Job 2 was at 60% (M) instead) Job1 closer to 100% in first phase: MPJ CASE 3: Both jobs in Dec phase (similar)

Evaluation Setup Testbed
9 Quad-core servers with 64GB memory, 200GB disks, running Ubuntu 14.04 Test Algorithms: Single source shortest path (SSSP), K-core decomposition (KC), Page-rank (PR) Graphs: Uniform Randomly Generated Synthetic graphs Performance Metric: Job completion time Compared Scheduling Algorithms: Baseline (B): default YARN FIFO policy (RF=3) PADP (P): OPTiC PADP policy (RF=3 + opportunistically created replica (at-most 1))

Facebook Production Trace Workload
95th percentile TAT improves by 54% Median TAT improves by 73% (seconds) Job size distribution from Facebook Trace (Vertex count proportional to map count) Most jobs in cluster are small Poisson arrival process with mean 7s, Network delay LN(3ms)

Yahoo! Production Trace Workload
Map-reduce job trace from Yahoo! Production cluster of several hundreds of servers Trace has 300 jobs with job size and job arrival times Bursty arrival process Heterogeneous jobs: mixture of SSSP, KC, PR 95th percentile TAT improves by 70% Median TAT improves by 78%

Scale and Graph Commonality Experiment
B Baseline (B) PADP (P) B B B Baseline and PADP alternating P P P P Graph commonality (degree of graph sharing among jobs) increases left to right Average graph size also increases from left to right

Related Work Cluster Schedulers (Map-reduce abstraction, multi-tenant)
YARN, Fair Scheduler Mesos, Dominant Resource Fairness Multi-tenancy with fairness for sharing cluster resources OPTiC scheduler aware of graph computation progress Graph Processing (Single-tenant) Pregel, first message passing system based on BSP GraphLab proposes shared memory computation PowerGraph optimizes for power-law graphs LFGraph improves performance with cheap partitioning and publish-subscribe message flow OPTiC improves performance for multi-tenant graph processing Progress Estimation Many systems for estimating progress of map-reduce jobs, e.g., KAMD SQL Progress Estimators, e.g., DNE (Driver Node Estimator), TGN (Total Get Next) OPTiC progress estimator based on graph processing level metrics

Summary of OPTiC OPTiC is the first multi-tenant graph processing system Key techniques Prefetching: we overlap graph pre-processing phase of waiting jobs with computation phase of running jobs Progress Estimation: we propose a new algorithm for estimating progress of graph processing jobs using a graph level metric independent of the underlying cluster and job details We obtain 20-82% improvement in job completion time for realistic workloads under realistic network conditions Cost of increased replication of input graph in DFS (3 to 3 + opportunistically created replica (at-most 1))

OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters

Similar presentations

Presentation on theme: "OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters

Similar presentations

Presentation on theme: "OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters"— Presentation transcript:

Similar presentations

About project

Feedback