PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3.

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3

Predicting Runtime of Iterative Analytics 2 computation messaging synch Requirements: # of iterations per iteration resources (key features), i.e., for Bulk Synchronous Parallel (BSP): cost model Challenges: dependence on prior iterations variable resource requirements Time Iteration 1 Workers Time Iteration 2 Partitioned Input

PREDIcT at a Glance 3 Cost model for BSP Execution Model Resources Iterations Sample run Iterations Actual run Resources Transformations: Input dataset: sampling Parameters: transform function Prediction methodology for iterative analytics on graphs: Proportionality for resources, similarity for # of iterations

Supported Analytics Similar transformations for algorithms with a global convergence metric Global convergence metric: e.g., an average, a ratio, fix point Ranking (e.g., PageRank) Graph processing (e.g., neighborhood estimation) Graph clustering (e.g., semi-clustering)

Example: PageRank  Sampling technique  Transform function PageRank of a page: given by the rank of its inbound pages Rank computation: iterative Convergence: RankChange <  G 1. graph structure: connectivity, degree ratio, diameter 2. parameters: N,  G 1 2 4 3 8 7 6 5 G

Sampling: Biased Random Jump Variation of Random Jump (RJ) / random walk Sampling scale-free graphs: e.g., web graphs 11 1 2 3 5 4 6 7 8 910 12 13 14 15 16 2 3 56 8 9 12 13 11 1 5 4 6 7 8 9 RJ BRJ Seed vertices: k high out degree nodes (hubs) G Disconnected Connected sample BRJ: Improving connectivity at the same sampling ratio

Transformations: Preserving Iterations 1 3 8 5 S Sampling Ratio (SR) = 50% 1 2 4 3 8 7 6 5 G Convergence: RankChange (G) <  G  S =  G / SR Average rank change : RankChange(S) prop. w/ RankChange(G) Transform function T: Sample and transform function preserve iterations S maintains: connectivity, in/out degree ratio, effective diameter

Prediction Cost Model F (X 1,…,X k ) Extrapolator Runtime Scaled features Profiled features Sample run Estimated actual run Two extrapolation factors: on edges on vertices Customized cost model for the Bulk Synchronous Parallel execution model: i.e., Giraph BSP

9 Time Iteration 1 Workers Partitioned Input Cost Model: Translating Features into Time Active vertices, message counts Message counts / sizes, Locality of messages S kew computation messaging synch Each phase but synch: multivariate linear regression Synchronization: identifying critical path Bulk Synchronous Parallel Model

Experimental Evaluation Setup : 10 machines, 6C CPUs Intel X5660, 48GB RAM, 1Gbps Datasets : Real graph datasets: Wikipedia (Wiki), Twitter (TW), UK-2002 (UK), LiveJournal(LJ), with sizes in [1,25] GB Representative Algorithms : PageRank (PR), Top-k Ranking and semi-clustering (SC) Default transformations: BRJ and Tr = (ID Conf,  S =  G / SR) Metrics : signed relative error: RE=(Predicted - Actual) * 100 % / Actual (i.e., “+” = over-prediction, “-” = under- prediction) 10

Predicting Features (Iterations) Giraph BSP, 10 machines, real datasets in [1,25] GB

Predicting Features (Iterations) Predicting iterations for semi-clustering: Ϯ = 0:01(left), and Ϯ = 0:001 (right).

Predicting Features (Iterations) Predicting key features for top-k ranking: Predicting iterations (left), and predicting remote message bytes (right).

Predicting Features (Iterations) PageRank Sampling Ratio = 0.1 PREDIcT reduces relative error from [104, 168]% to [0, 11]%

Predicting Time Semi-clustering Neighborhood estimation [10, 30]% relative error for 15% sample Algorithms with variable work/iteration Cumulated impact of: # of iterations and per iteration resources

Impact PREDIcT: Experimental methodology for estimating key features and runtime for iterative analytics on graphs Enables key feature prediction: pluggable transformations, and runtime prediction: cost model Accurate empirical solution: Iterations: [0, 11]% (opposed to [104,168]%) Time: [10, 30]% http://dias.epfl.ch/predict Thank you!

Backup slides 17

Cost Model: Model Fitting Multivariate regression Pool of BSP features Model Fitting Historical runs Training data: sample run + historical runs (if such runs exist) Customizable cost model (per input algorithm) F (X 1,…,X k ) Sample run 18

Cost Model compute message sync Iteration W1W1 W2W2 W3W3 Active vertices, Message counts Message counts, Message sizes, Locality of messages Partitioning scheme / skew Bulk Synchronous Parallel execution model Specialized for network intensive algorithms Each phase but sync: multivariate regression Synchronization modeled implicitly Customized Cost Model for Bulk Synchronous Parallel Execution Model

Feasibility Analysis 20 Feasible for algorithms dominated by iteration time

Context: BSP Processing Model Giraph BSP W1W1 W2W2 W3W3 W4W4 Vertex centric model: Each vertex performs local processing, then messaging Algorithms in BSP are inherently iterative Iteration W1W1 W2W2 W3W3 compute message sync Bulk Synchronous Parallel (BSP) W4W4

Prediction Cost Model F (X 1,…,X k ) Extrapolator Runtime Scaled features Profiled features Sample run Estimated actual run Two extrapolation factors: on edges on vertices Customized cost model for the Bulk Synchronous Parallel execution model: i.e., Giraph BSP

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3.

Similar presentations

Presentation on theme: "PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3.

Similar presentations

Presentation on theme: "PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3."— Presentation transcript:

Similar presentations

About project

Feedback