Pregelix: Think Like a Vertex, Scale Like Spandex

Slides:

Advertisements

Similar presentations

Object-Orientation Meets Big Data Language Techniques towards Highly- Efficient Data-Intensive Computing Harry Xu UC Irvine.

Advertisements

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

epiC: an Extensible and Scalable System for Processing Big Data

Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU

1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Armend Hoxha Trevor Hodde Kexin Shi Mizan: A system for Dynamic Load Balancing in Large-Scale Graph Processing Presented by:

Distributed Graph Processing Abhishek Verma CS425.

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 22: Stream Processing, Graph Processing All slides © IG.

SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Pregel: A System for Large-Scale Graph Processing

Hyracks: A new partitioned-parallel platform for data-intensive computation Vinayak Borkar UC Irvine (joint work with M. Carey, R. Grover, N. Onose, and.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

Ashwani Roy Understanding Graphical Execution Plans Level 200.

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Big Data Open Source Software and Projects ABDS in Summary XVIII: Layer 14A Data Science Curriculum March Geoffrey Fox

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees Da Yan (CUHK), James Cheng (CUHK), Kai Xing (HKUST), Yi Lu (CUHK), Wilfred.

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Data Structures and Algorithms in Parallel Computing Lecture 4.

Data Structures and Algorithms in Parallel Computing

Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.

Acknowledgement: Arijit Khan, Sameh Elnikety. Google: > 1 trillion indexed pages Web GraphSocial Network Facebook: > 1.5 billion active users 31 billion.

REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.

1 Tree and Graph Processing On Hadoop Ted Malaska.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Presented by: Omar Alqahtani Fall 2016

TensorFlow– A system for large-scale machine learning

Big Data is a Big Deal!.

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Spark Presentation.

Scaling SQL with different approaches

Harry Xu University of California, Irvine & Microsoft Research

Yak: A High-Performance Big-Data-Friendly Garbage Collector

Speculative Region-based Memory Management for Big Data Systems

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

SpatialHadoop: A MapReduce Framework for Spatial Data

Introduction to Spark.

Data Structures and Algorithms in Parallel Computing

Towards Effective Partition Management for Large Graphs

Yak: A High-Performance Big-Data-Friendly Garbage Collector

湖南大学-信息科学与工程学院-计算机与科学系

Pregelix: Big(ger) Graph Analytics on A Dataflow Engine

Distributed Systems CS

Cse 344 May 4th – Map/Reduce.

Database Query Execution

CS110: Discussion about Spark

Replication-based Fault-tolerance for Large-scale Graph Processing

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Parallel Applications And Tools For Cloud Computing Environments

Overview of big data tools

DryadInc: Reusing work in large-scale computations

A Map-Reduce System with an Alternate API for Multi-Core Environments

Computational Advertising and

Presentation transcript:

Pregelix: Think Like a Vertex, Scale Like Spandex Yingyi Bu (UC Irvine) Work with: Vinayak Borkar (UC Irvine) , Michael J. Carey (UC Irvine), Tyson Condie (Microsoft & UCLA)

Outline Introduction Programming Model Example Applications System Internals Experimental Results Related Work Conclusions

Introduction Big Graphs are becoming common web graph social network ......

Introduction How Big are Big Graphs? Weapons for mining Big Graphs Web: 8.53 Billion pages in 2012 Facebook active users: 1.01 Billion de Bruijn graph: 3 Billion nodes ...... Weapons for mining Big Graphs Hadoop/Hive (Facebook) Pregel (Google) Distributed GraphLab (CMU)

Programming Model (Pregel) Think like a vertex receive messages update states send messages

Programming Model (BSP) Receive msgs Update states Send msgs Receive msgs an iteration Bulk synchronized A synchronization barrier between iterations

Programming Model - API Vertex (a super class for all applications) public abstract class Vertex<I extends WritableComparable, V extends Writable, E extends Writable, M extends Writable> implements Writable{ /** * @param msgIterator an iterator of incoming messages */ public abstract void compute(Iterator<M> msgIterator); ....... } Helper methods sendMsg(I vertexId, M msg) voteToHalt()

Programming Model - Optional APIs Combiner Combine messages Reduce network traffic Global Aggregator Aggregate statistics over all vertices Done for each iteration Early Termination (not in standard Pregel) Force the job to terminate

Example Applications PageRank ConnectedComponents Shortest Paths Reachability query Start the Demo!

Vertex/map/msg data structures System Internals Pregel GraphLab Giraph ...... Vertex/map/msg data structures Task scheduling Memory management Message delivery Network management Our philosophy Stop building one-off systems like Pregel, GraphLab, and Giraph, instead, building them on a data-flow engine!

System Internals Pregelix dest_idUDAF(combine) UDF (compute) Pregel Semantics Barrier Vertex/map/msg data structures Msg Vertice Task scheduling Task scheduling Record/Index management Memory management Data exchanging Buffer management Message delivery Connection management Network management A general purpose parallel dataflow engine

System Internals - Runtime Runtime Choice? Hyracks Hadoop The UCI Hyracks data-parallel execution engine connection management a set of operators: sorting, grouping, joining task scheduling for jobs (a DAG of operators) index support: B-tree, LSM-Btree, R-tree....

System Internals - Storage Pregelix Job DFS DFS B-tree bulkload Sorting DFS Read B-tree bulkload Sorting DFS Read Sorting B-tree bulkload DFS Read B-tree index scan DFS Write B-tree index scan DFS Write B-tree index scan DFS Write

System Internals - Outer Join Execution Plan dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) Barrier UDF (compute) Barrier Barrier UDF (compute) UDF (compute) Msg Msg Msg Vertice B-tree Vertice B-tree Vertice B-tree

System Internals -Inner Join Execution Plan dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) dest_idUDAF(combine) Barrier UDF (compute) Barrier Barrier UDF (compute) UDF (compute) Live vertex IDs Live vertex IDs Live vertex IDs Vertice B-tree Vertice B-tree Vertice B-tree Msg Msg Msg

System Internals - Implementations Right-outer join Index merging join Sender-side group-by Sort + pre-clustered group-by Data redistribution Hash merging repartitioning connector Sender-side materialization pipelining Receiver-side group-by Pre-clustered group-by Inner join Index probing join Set Union Index set union

System Internals Iteration-aware (sticky) scheduling? Spark, GraphLab, HaLoop all have caches for this kind of iterative jobs. What do you do for caching? Iteration-aware (sticky) scheduling? 1 Loc: location constraints Caching of invariant data? B-tree buffer pool -- 1 Loc: never flush dirty pages File system cache -- free

Experimental Results Setup Machines: Yahoo! Research cluster ~ 180 machines. Each has 8 cores, 12GB memory, 4 disk drives. Dataset: Yahoo! webmap (1,413,511,393 vertice)

Experimental Results 10 iteration PageRank 1x webmap dataset

Experimental Results 10 iteration PageRank 1x webmap on 88 machines, 2x webmap on 175 machines

Related Work Spark [NSDI 2012] HaLoop [VLDB 2010] Giraph Mahout OutOfMemoryError HaLoop [VLDB 2010] Only 1.8X from Hadoop Giraph Mahout Distributed GraphLab [VLDB 2012] Haven't tried yet (just published in September...)

Conclusions Vertex-oriented programming model is simple Dataflow implementation is neat and efficient We target Pregelix to be an open-sourced production system, rather than just a research prototype: http://hyracks.org/projects/pregelix/

Q & A