Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong

Similar presentations


Presentation on theme: "Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong"— Presentation transcript:

1 Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong yu@se.cuhk.edu.hkyu@se.cuhk.edu.hk, http://www.se.cuhk.edu.hk/~yuhttp://www.se.cuhk.edu.hk/~yu

2 Big Graphs/Networks

3 Graph Systems There are many and many graph systems in the literature. 3

4 Graph Computing on Cloud Workload Balancing Auto Approximation 4

5 Vertex-Centric Computing on BSP Distributed Vertex-centric Computing BSP (Bulk Synchronous Parallel)  Concurrent computing  Communication  Barrier synchronization

6 Workload Balancing Computing  Determined by the slowest  Workload balancing Communication  The volume matters  Cross-edges Computing + Communication  Balanced Partitioning

7 Balanced k-way Graph Partitioning  Size balanced partition  The minimum possible cross-edges It solves our problem if the graph is static  By static, we mean the vertices are always active during the computation However, for graph analytics, the vertices may toggle between active and inactive. Workload Balancing

8 Dynamic Workload Balancing 8 Computing  Determined by the slowest  Workload balancing Communication  The volume matters  Cross-edges Dynamic workload balancing  Respond to vertices’ status active/inactive

9 We do not know anything about what graph algorithms will be used. We do not know anything about graphs themselves. We cannot request graphs to be ‘well’ partitioned on Cloud. We cannot assume how graphs are initially partitioned on Cloud. It needs to react to workload balancing in good timing, and it cannot take long to balance itself. Any General Approach?

10 An Example

11 PageRank Semi-clustering Graph Coloring Single Source Shortest Path Breadth First Search Random Walk Maximal Matching Minimum Spanning Tree Maximal Independent Sets Representative Graph Algorithms

12 The three algorithms  PageRank  Semi-clustering  Graph Coloring The vertices are always active Ideal case for static partition  Perfectly balanced as expected Category 1: Always Active

13 The Three Algorithms  Single Source Shortest Path  Breadth First Search  Random Walk Significantly imbalanced Category 2: Traversal

14 The Three Algorithms  Maximal Matching  Minimum Spanning Tree  Maximal Independent Sets Somewhat balanced Category 3: Multi-Phases

15 Predicable? For category 1, the algorithms have stable working window. For category 2, even though the predictability cannot be ensured, however, most of large scale algorithms have the low-diameter property.  SSS has a reasonable hit-rate between supersteps. For Category 3, the hit-rate between two successive phases is very high, due to the algorithm design.

16 Our Approach [Shang et al. ICDE’13]

17 Some Basic Ideas

18 Compare with Random Partitioning

19 Graph Computing on Cloud The factors  Memory consumption, communication cost, CPU cost, and the number of rounds. The classes  MapReduce Class (MRC) by Karloff et al. in SODA’10.  Minimal MapReduce Class (MMC) by Tao et al. in SIGMOD’13.  Scalable Graph Processing (SGC) on MapReduce by Qin et al. in SIGMOD’14.  Balanced Practical Pregel Algorithms (BPPA) on BSP by Yan et al. in VLDB’14.

20 Big data and bigger data  Google: 2+EB  twitter: hit 8PB  Yahoo: 400PB  Facebook: 300PB Big data needs to get answers fast More data beat cleaver algorithm  A few useful things to know about machine learning by P. Domingos in CACM 2012. Auto-Approximate Graph Computing [Sang et al. VLDB’15]

21 Work in distributed environment is hard Designing a new algorithm is hard A new distributed approx. algorithm?  Hard + hard The target is fast answer! But, it is impossible to know the meaning of programs. Why Auto-Approximate?

22 To modify the vertex-centric programs (UDF) Auto-Approximate Graph Computing Traditional Computing Approximation Computing

23 The Errors Init value Default UDF Approx. UDF final results error term

24 The Errors The error comes from two sides  The “bad” input Error inherited from previous iterations  Wrong calculation Error from the new approx. UDF

25 Approximation There does not exist a way to have an approach that can approximate all problems, as restricted by Rice’s theorem.  Any nontrivial property about the language recognized by a Turing machine is undecidable. Approximation  Continuous functions  Discrete functions The notions of continuity from mathematical analysis are relevant and interesting even for software by Chaudhuri et al. in CACM, 2012.  shortest paths, minimum spanning trees 25

26

27 An Example Sampling as an example  Find chances of sampling  Synthesize codes  Correct the answer by regression

28 Error-Time Tradeoff

29 The Sampling Strategies

30 Graph Algorithms 30

31 Real Datasets 31

32 PR over twitter-mp (10 iterations)

33

34 The Eight Graph Algorithms

35 Time/Error Prediction

36 Some Remarks There are many reported graph systems in the literature. It needs to reconsider something new to explore further to deal with big graphs. 36


Download ppt "Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong"

Similar presentations


Ads by Google