Download presentation
Presentation is loading. Please wait.
Published byClemence McCormick Modified over 9 years ago
1
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3
2
Predicting Runtime of Iterative Analytics 2 computation messaging synch Requirements: # of iterations per iteration resources (key features), i.e., for Bulk Synchronous Parallel (BSP): cost model Challenges: dependence on prior iterations variable resource requirements Time Iteration 1 Workers Time Iteration 2 Partitioned Input
3
PREDIcT at a Glance 3 Cost model for BSP Execution Model Resources Iterations Sample run Iterations Actual run Resources Transformations: Input dataset: sampling Parameters: transform function Prediction methodology for iterative analytics on graphs: Proportionality for resources, similarity for # of iterations
4
Supported Analytics Similar transformations for algorithms with a global convergence metric Global convergence metric: e.g., an average, a ratio, fix point  Ranking (e.g., PageRank) Graph processing (e.g., neighborhood estimation) Graph clustering (e.g., semi-clustering)
5
Example: PageRank Sampling technique Transform function  PageRank of a page: given by the rank of its inbound pages Rank computation: iterative Convergence: RankChange < G 1. graph structure: connectivity, degree ratio, diameter 2. parameters: N, G 1 2 4 3 8 7 6 5 G
6
Sampling: Biased Random Jump Variation of Random Jump (RJ) / random walk Sampling scale-free graphs: e.g., web graphs 11 1 2 3 5 4 6 7 8 910 12 13 14 15 16 2 3 56 8 9 12 13 11 1 5 4 6 7 8 9 RJ BRJ Seed vertices: k high out degree nodes (hubs) G Disconnected Connected sample BRJ: Improving connectivity at the same sampling ratio
7
Transformations: Preserving Iterations 1 3 8 5 S Sampling Ratio (SR) = 50% 1 2 4 3 8 7 6 5 G Convergence: RankChange (G) < G  S = G / SR Average rank change : RankChange(S) prop. w/ RankChange(G) Transform function T: Sample and transform function preserve iterations S maintains: connectivity, in/out degree ratio, effective diameter
8
Prediction Cost Model F (X 1,…,X k ) Extrapolator Runtime Scaled features Profiled features Sample run Estimated actual run Two extrapolation factors: on edges on vertices Customized cost model for the Bulk Synchronous Parallel execution model: i.e., Giraph BSP 
9
9 Time Iteration 1 Workers Partitioned Input Cost Model: Translating Features into Time Active vertices, message counts Message counts / sizes, Locality of messages S kew computation messaging synch Each phase but synch: multivariate linear regression Synchronization: identifying critical path Bulk Synchronous Parallel Model
10
Experimental Evaluation Setup : 10 machines, 6C CPUs Intel X5660, 48GB RAM, 1Gbps Datasets : Real graph datasets: Wikipedia (Wiki), Twitter (TW), UK-2002 (UK), LiveJournal(LJ), with sizes in [1,25] GB Representative Algorithms : PageRank (PR), Top-k Ranking and semi-clustering (SC) Default transformations: BRJ and Tr = (ID Conf, S = G / SR) Metrics : signed relative error: RE=(Predicted - Actual) * 100 % / Actual (i.e., “+” = over-prediction, “-” = under- prediction) 10
11
Predicting Features (Iterations) Giraph BSP, 10 machines, real datasets in [1,25] GB 
12
Predicting Features (Iterations) Predicting iterations for semi-clustering: Ϯ = 0:01(left), and Ϯ = 0:001 (right).
13
Predicting Features (Iterations) Predicting key features for top-k ranking: Predicting iterations (left), and predicting remote message bytes (right).
14
Predicting Features (Iterations) PageRank Sampling Ratio = 0.1 PREDIcT reduces relative error from [104, 168]% to [0, 11]%
15
Predicting Time Semi-clustering Neighborhood estimation  [10, 30]% relative error for 15% sample Algorithms with variable work/iteration Cumulated impact of: # of iterations and per iteration resources
16
Impact PREDIcT: Experimental methodology for estimating key features and runtime for iterative analytics on graphs Enables key feature prediction: pluggable transformations, and runtime prediction: cost model Accurate empirical solution: Iterations: [0, 11]% (opposed to [104,168]%) Time: [10, 30]%  http://dias.epfl.ch/predict Thank you!
17
Backup slides 17
18
Cost Model: Model Fitting Multivariate regression Pool of BSP features Model Fitting Historical runs Training data: sample run + historical runs (if such runs exist) Customizable cost model (per input algorithm) F (X 1,…,X k ) Sample run 18
19
Cost Model compute message sync Iteration W1W1 W2W2 W3W3 Active vertices, Message counts Message counts, Message sizes, Locality of messages Partitioning scheme / skew Bulk Synchronous Parallel execution model Specialized for network intensive algorithms Each phase but sync: multivariate regression Synchronization modeled implicitly  Customized Cost Model for Bulk Synchronous Parallel Execution Model
20
Feasibility Analysis 20 Feasible for algorithms dominated by iteration time
21
Context: BSP Processing Model Giraph BSP W1W1 W2W2 W3W3 W4W4 Vertex centric model: Each vertex performs local processing, then messaging Algorithms in BSP are inherently iterative Iteration W1W1 W2W2 W3W3 compute message sync Bulk Synchronous Parallel (BSP) W4W4
22
Prediction Cost Model F (X 1,…,X k ) Extrapolator Runtime Scaled features Profiled features Sample run Estimated actual run Two extrapolation factors: on edges on vertices Customized cost model for the Bulk Synchronous Parallel execution model: i.e., Giraph BSP 
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.