Presentation is loading. Please wait.

Presentation is loading. Please wait.

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3.

Similar presentations


Presentation on theme: "PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3."— Presentation transcript:

1 PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3

2 Predicting Runtime of Iterative Analytics 2 computation messaging synch Requirements: # of iterations per iteration resources (key features), i.e., for Bulk Synchronous Parallel (BSP): cost model Challenges: dependence on prior iterations variable resource requirements Time Iteration 1 Workers Time Iteration 2 Partitioned Input

3 PREDIcT at a Glance 3 Cost model for BSP Execution Model Resources Iterations Sample run Iterations Actual run Resources Transformations: Input dataset: sampling Parameters: transform function Prediction methodology for iterative analytics on graphs: Proportionality for resources, similarity for # of iterations

4 Supported Analytics Similar transformations for algorithms with a global convergence metric Global convergence metric: e.g., an average, a ratio, fix point  Ranking (e.g., PageRank) Graph processing (e.g., neighborhood estimation) Graph clustering (e.g., semi-clustering)

5 Example: PageRank  Sampling technique  Transform function  PageRank of a page: given by the rank of its inbound pages Rank computation: iterative Convergence: RankChange <  G 1. graph structure: connectivity, degree ratio, diameter 2. parameters: N,  G 1 2 4 3 8 7 6 5 G

6 Sampling: Biased Random Jump Variation of Random Jump (RJ) / random walk Sampling scale-free graphs: e.g., web graphs 11 1 2 3 5 4 6 7 8 910 12 13 14 15 16 2 3 56 8 9 12 13 11 1 5 4 6 7 8 9 RJ BRJ Seed vertices: k high out degree nodes (hubs) G Disconnected Connected sample BRJ: Improving connectivity at the same sampling ratio

7 Transformations: Preserving Iterations 1 3 8 5 S Sampling Ratio (SR) = 50% 1 2 4 3 8 7 6 5 G Convergence: RankChange (G) <  G   S =  G / SR Average rank change : RankChange(S) prop. w/ RankChange(G) Transform function T: Sample and transform function preserve iterations S maintains: connectivity, in/out degree ratio, effective diameter

8 Prediction Cost Model F (X 1,…,X k ) Extrapolator Runtime Scaled features Profiled features Sample run Estimated actual run Two extrapolation factors: on edges on vertices Customized cost model for the Bulk Synchronous Parallel execution model: i.e., Giraph BSP 

9 9 Time Iteration 1 Workers Partitioned Input Cost Model: Translating Features into Time Active vertices, message counts Message counts / sizes, Locality of messages S kew computation messaging synch Each phase but synch: multivariate linear regression Synchronization: identifying critical path Bulk Synchronous Parallel Model

10 Experimental Evaluation Setup : 10 machines, 6C CPUs Intel X5660, 48GB RAM, 1Gbps Datasets : Real graph datasets: Wikipedia (Wiki), Twitter (TW), UK-2002 (UK), LiveJournal(LJ), with sizes in [1,25] GB Representative Algorithms : PageRank (PR), Top-k Ranking and semi-clustering (SC) Default transformations: BRJ and Tr = (ID Conf,  S =  G / SR) Metrics : signed relative error: RE=(Predicted - Actual) * 100 % / Actual (i.e., “+” = over-prediction, “-” = under- prediction) 10

11 Predicting Features (Iterations) Giraph BSP, 10 machines, real datasets in [1,25] GB 

12 Predicting Features (Iterations) Predicting iterations for semi-clustering: Ϯ = 0:01(left), and Ϯ = 0:001 (right).

13 Predicting Features (Iterations) Predicting key features for top-k ranking: Predicting iterations (left), and predicting remote message bytes (right).

14 Predicting Features (Iterations) PageRank Sampling Ratio = 0.1 PREDIcT reduces relative error from [104, 168]% to [0, 11]%

15 Predicting Time Semi-clustering Neighborhood estimation  [10, 30]% relative error for 15% sample Algorithms with variable work/iteration Cumulated impact of: # of iterations and per iteration resources

16 Impact PREDIcT: Experimental methodology for estimating key features and runtime for iterative analytics on graphs Enables key feature prediction: pluggable transformations, and runtime prediction: cost model Accurate empirical solution: Iterations: [0, 11]% (opposed to [104,168]%) Time: [10, 30]%  http://dias.epfl.ch/predict Thank you!

17 Backup slides 17

18 Cost Model: Model Fitting Multivariate regression Pool of BSP features Model Fitting Historical runs Training data: sample run + historical runs (if such runs exist) Customizable cost model (per input algorithm) F (X 1,…,X k ) Sample run 18

19 Cost Model compute message sync Iteration W1W1 W2W2 W3W3 Active vertices, Message counts Message counts, Message sizes, Locality of messages Partitioning scheme / skew Bulk Synchronous Parallel execution model Specialized for network intensive algorithms Each phase but sync: multivariate regression Synchronization modeled implicitly  Customized Cost Model for Bulk Synchronous Parallel Execution Model

20 Feasibility Analysis 20 Feasible for algorithms dominated by iteration time

21 Context: BSP Processing Model Giraph BSP W1W1 W2W2 W3W3 W4W4 Vertex centric model: Each vertex performs local processing, then messaging Algorithms in BSP are inherently iterative Iteration W1W1 W2W2 W3W3 compute message sync Bulk Synchronous Parallel (BSP) W4W4

22 Prediction Cost Model F (X 1,…,X k ) Extrapolator Runtime Scaled features Profiled features Sample run Estimated actual run Two extrapolation factors: on edges on vertices Customized cost model for the Bulk Synchronous Parallel execution model: i.e., Giraph BSP 


Download ppt "PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki 1 1 2 3."

Similar presentations


Ads by Google