Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong

Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong yu@se.cuhk.edu.hkyu@se.cuhk.edu.hk, http://www.se.cuhk.edu.hk/~yuhttp://www.se.cuhk.edu.hk/~yu

Big Graphs/Networks

Graph Systems There are many and many graph systems in the literature. 3

Graph Computing on Cloud Workload Balancing Auto Approximation 4

Vertex-Centric Computing on BSP Distributed Vertex-centric Computing BSP (Bulk Synchronous Parallel)  Concurrent computing  Communication  Barrier synchronization

Workload Balancing Computing  Determined by the slowest  Workload balancing Communication  The volume matters  Cross-edges Computing + Communication  Balanced Partitioning

Balanced k-way Graph Partitioning  Size balanced partition  The minimum possible cross-edges It solves our problem if the graph is static  By static, we mean the vertices are always active during the computation However, for graph analytics, the vertices may toggle between active and inactive. Workload Balancing

Dynamic Workload Balancing 8 Computing  Determined by the slowest  Workload balancing Communication  The volume matters  Cross-edges Dynamic workload balancing  Respond to vertices’ status active/inactive

We do not know anything about what graph algorithms will be used. We do not know anything about graphs themselves. We cannot request graphs to be ‘well’ partitioned on Cloud. We cannot assume how graphs are initially partitioned on Cloud. It needs to react to workload balancing in good timing, and it cannot take long to balance itself. Any General Approach?

An Example

PageRank Semi-clustering Graph Coloring Single Source Shortest Path Breadth First Search Random Walk Maximal Matching Minimum Spanning Tree Maximal Independent Sets Representative Graph Algorithms

The three algorithms  PageRank  Semi-clustering  Graph Coloring The vertices are always active Ideal case for static partition  Perfectly balanced as expected Category 1: Always Active

The Three Algorithms  Single Source Shortest Path  Breadth First Search  Random Walk Significantly imbalanced Category 2: Traversal

The Three Algorithms  Maximal Matching  Minimum Spanning Tree  Maximal Independent Sets Somewhat balanced Category 3: Multi-Phases

Predicable? For category 1, the algorithms have stable working window. For category 2, even though the predictability cannot be ensured, however, most of large scale algorithms have the low-diameter property.  SSS has a reasonable hit-rate between supersteps. For Category 3, the hit-rate between two successive phases is very high, due to the algorithm design.

Our Approach [Shang et al. ICDE’13]

Some Basic Ideas

Compare with Random Partitioning

Graph Computing on Cloud The factors  Memory consumption, communication cost, CPU cost, and the number of rounds. The classes  MapReduce Class (MRC) by Karloff et al. in SODA’10.  Minimal MapReduce Class (MMC) by Tao et al. in SIGMOD’13.  Scalable Graph Processing (SGC) on MapReduce by Qin et al. in SIGMOD’14.  Balanced Practical Pregel Algorithms (BPPA) on BSP by Yan et al. in VLDB’14.

Big data and bigger data  Google: 2+EB  twitter: hit 8PB  Yahoo: 400PB  Facebook: 300PB Big data needs to get answers fast More data beat cleaver algorithm  A few useful things to know about machine learning by P. Domingos in CACM 2012. Auto-Approximate Graph Computing [Sang et al. VLDB’15]

Work in distributed environment is hard Designing a new algorithm is hard A new distributed approx. algorithm?  Hard + hard The target is fast answer! But, it is impossible to know the meaning of programs. Why Auto-Approximate?

To modify the vertex-centric programs (UDF) Auto-Approximate Graph Computing Traditional Computing Approximation Computing

The Errors Init value Default UDF Approx. UDF final results error term

The Errors The error comes from two sides  The “bad” input Error inherited from previous iterations  Wrong calculation Error from the new approx. UDF

Approximation There does not exist a way to have an approach that can approximate all problems, as restricted by Rice’s theorem.  Any nontrivial property about the language recognized by a Turing machine is undecidable. Approximation  Continuous functions  Discrete functions The notions of continuity from mathematical analysis are relevant and interesting even for software by Chaudhuri et al. in CACM, 2012.  shortest paths, minimum spanning trees 25

An Example Sampling as an example  Find chances of sampling  Synthesize codes  Correct the answer by regression

Error-Time Tradeoff

The Sampling Strategies

Graph Algorithms 30

Real Datasets 31

PR over twitter-mp (10 iterations)

The Eight Graph Algorithms

Time/Error Prediction

Some Remarks There are many reported graph systems in the literature. It needs to reconsider something new to explore further to deal with big graphs. 36

Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong

Similar presentations

Presentation on theme: "Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong

Similar presentations

Presentation on theme: "Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong"— Presentation transcript:

Similar presentations

About project

Feedback