Joseph Gonzalez Postdoc, UC Berkeley AMPLab A System for Distributed Graph-Parallel Machine Learning Yucheng Low Aapo Kyrola.

Slides:

Advertisements

Similar presentations

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Advertisements

Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU

Differentiated Graph Computation and Partitioning on Skewed Graphs

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Distributed Graph Processing Abhishek Verma CS425.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

GraphChi: Big Data – small machine

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.

Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.

GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

From Graphs to Tables: The Design of Scalable Systems for Graph Analytics Joseph E. Gonzalez Post-doc, UC Berkeley AMPLab Co-founder,

A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.

Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. CMU

Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.

BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

GraphLab A New Framework for Parallel Machine Learning

Pregel: A System for Large-Scale Graph Processing

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

Carnegie Mellon University GraphLab Tutorial Yucheng Low.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.

Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

GraphX: Unifying Table and Graph Analytics

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Introduction to Large-Scale Graph Computation

Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.

Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.

GraphX: Graph Analytics on Spark

Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.

Data Structures and Algorithms in Parallel Computing

Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.

Graph-Based Parallel Computing

Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.

Acknowledgement: Arijit Khan, Sameh Elnikety. Google: > 1 trillion indexed pages Web GraphSocial Network Facebook: > 1.5 billion active users 31 billion.

PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;

A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+

Parallel and Distributed Systems for Probabilistic Reasoning

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

CSCI5570 Large Scale Data Processing Systems

Big Learning with Graphs

Distributed Graph-Parallel Computation on Natural Graphs

CSCI5570 Large Scale Data Processing Systems

CSCI5570 Large Scale Data Processing Systems

Introduction to Spark.

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

Replication-based Fault-tolerance for Large-scale Graph Processing

COS 418: Distributed Systems Lecture 19 Wyatt Lloyd

Saeed Rahmani, Dr. Mohammd Hadi Sadroddini Shiraz University

Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali

Presentation transcript:

Joseph Gonzalez Postdoc, UC Berkeley AMPLab A System for Distributed Graph-Parallel Machine Learning Yucheng Low Aapo Kyrola Danny Bickson Alex Smola Haijie Gu The Team: Carlos Guestrin Guy Blelloch

2 Big GraphsData More Signal More Noise

It’s all about the graphs…

Social Media Graphs encode relationships between: Big : billions of vertices and edges and rich metadata AdvertisingScienceWeb People Facts Products Interests Ideas 4

Graphs are Essential to Data-Mining and Machine Learning Identify influential people and information Find communities Target ads and products Model complex data dependencies 5

Liberal Conservative Post Estimate Political Bias Post ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 6 Conditional Random Field Belief Propagation

Triangle Counting For each vertex in graph, count number of triangles containing it Measures both “popularity” of the vertex and “cohesiveness” of the vertex’s community:

Collaborative Filtering: Exploiting Dependencies recommend City of God Wild Strawberries The Celebration La Dolce Vita Women on the Verge of a Nervous Breakdown What do I recommend???

Latent Topic Modeling (LDA) Cat Apple Growth Hat Plant

Matrix Factorization Alternating Least Squares (ALS) Users Movies Netflix Users ≈ x Movies f(i) f(j) r 13 r 14 r 24 r 25 f(1) f(2) f(3) f(4) f(5) User Factors (U) Movie Factors (M) r 23 Iterate: Factor for User i Factor for Movie j

PageRank Everyone starts with equal ranks Update ranks in parallel Iterate until convergence Rank of user i Weighted sum of neighbors’ ranks 11

How should we program graph-parallel algorithms? Low-level tools like MPI and Pthreads? - Me, during my first years of grad school 12

Threads, Locks, and MPI ML experts repeatedly solve the same parallel design challenges: – Implement and debug complex parallel system – Tune for a single parallel platform – Six months later the conference paper contains: “We implemented ______ in parallel.” The resulting code: – is difficult to maintain and extend – couples learning model and implementation 13 Graduate students

How should we program graph-parallel algorithms? 14 High-level Abstractions! - Me, now

The Graph-Parallel Pattern A user-defined Vertex-Program runs on each vertex Graph constrains interaction along edges – Using messages (e.g. Pregel [PODC’09, SIGMOD’10]) – Through shared state (e.g., GraphLab [UAI’10, VLDB’12, OSDI’12]) Parallelism: run multiple vertex programs simultaneously 15 “Think like a Vertex.” -Malewicz et al. [SIGMOD’10]

Better for Machine Learning Graph-parallel Abstractions 16 Shared State i i Dynamic Asynchronous Messaging i i Synchronous

The GraphLab Vertex Program Vertex Programs directly access adjacent vertices and edges GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * w ji // Update the PageRank R[i] = total // Trigger neighbors to run again if R[i] not converged then signal nbrsOf(i) to be recomputed 17 R[4] * w 41 R[3] * w 31 R[2] * w Signaled vertices are recomputed eventually.

GraphLab Asynchronous Execution CPU 1 CPU 2 The scheduler determines the order that vertices are executed e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler Scheduler can prioritize vertices.

Benefit of Dynamic PageRank 51% updated only once! Better 19

Asynchronous Belief Propagation Synthetic Noisy Image Cumulative Vertex Updates Many Updates Few Updates Algorithm identifies and focuses on hidden sequential structure Algorithm identifies and focuses on hidden sequential structure Graphical Model Challenge = Boundaries

GraphLab Ensures a Serializable Execution Needed for Gauss-Seidel, Gibbs Sampling, Graph Coloring, … Conflict Edge Conflict Edge

Never Ending Learner Project (CoEM) Language modeling: named entity recognition 22 GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop (BSP)95 Cores7.5 hrs Distributed GraphLab 32 EC2 machines 80 secs 0.3% of Hadoop time

GraphLab provided a powerful new abstraction But… Thus far… We couldn’t scale up to Altavista Webgraph from B vertices, 6.6B edges

24 Natural Graphs Graphs derived from natural phenomena

Properties of Natural Graphs 25 Power-Law Degree Distribution Regular Mesh Natural Graph

Power-Law Degree Distribution Top 1% of vertices are adjacent to 50% of the edges! High-Degree Vertices 26 Number of Vertices AltaVista WebGraph 1.4B Vertices, 6.6B Edges Degree More than 10 8 vertices have one neighbor. -Slope = α ≈ 2

Power-Law Degree Distribution 27 “Star Like” Motif President Obama Followers

Challenges of High-Degree Vertices Touches a large fraction of graph Sequentially process edges 28 CPU 1 CPU 2 Provably Difficult to Partition

Machine 1 Machine 2 Random Partitioning GraphLab resorts to random (hashed) partitioning on natural graphs 10 Machines  90% of edges cut 100 Machines  99% of edges cut! 29

Machine 1 Machine 2 Split High-Degree vertices New Abstraction  Equivalence on Split Vertices 30 Program For This Run on This

Gather Information About Neighborhood Update Vertex Signal Neighbors & Modify Edge Data A Common Pattern for Vertex-Programs GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in neighbors(i)): total = total + R[j] * w ji // Update the PageRank R[i] = total // Trigger neighbors to run again priority = |R[i] – oldR[i]| if R[i] not converged then signal neighbors(i) with priority 31

Pattern  GAS Decomposition Y + … +  Y Parallel Sum User Defined: Gather( )  Σ Y Σ 1 + Σ 2  Σ 3 Y G ather (Reduce) Apply the accumulated value to center vertex A pply Update adjacent edges and vertices. S catter Accumulate information about neighborhood Y + User Defined: Apply(, Σ )  Y’Y’ Y’Y’ Y Y Σ Σ Y’Y’ Y’Y’ Update Edge Data & Activate Neighbors User Defined: Scatter( )  Y’ 32

Signal Neighbors & Modify Edge Data Update Vertex Gather Information About Neighborhood Pattern  Abstraction Gather(SrcV, Edge, DstV)  A – Collect information from a neighbor Sum(A, A)  A – Commutative associative sum Apply(V, A)  V – Update the vertex Scatter(SrcV, Edge, DstV)  (Edge, signal) – Update edges and signal neighbors 33

GraphLab2_PageRank(i) Gather( j  i ) : return w ji * R[j] sum(a, b) : return a + b; Apply (i, Σ) : R[i] = Σ Scatter ( i  j ) : if R[i] changed then trigger j to be recomputed PageRank in GraphLab2 34

Machine 2 Machine 1 Machine 4 Machine 3 GAS Decomposition Σ1Σ1 Σ1Σ1 Σ2Σ2 Σ2Σ2 Σ3Σ3 Σ3Σ3 Σ4Σ4 Σ4Σ Y Y YY Y’ Σ Σ Gather Apply Scatter 35 Master Mirror

Minimizing Communication in PowerGraph Y YY A vertex-cut minimizes machines each vertex spans Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000] Communication is linear in the number of machines each vertex spans 36 New Theorem: For any edge-cut we can directly construct a vertex-cut which requires strictly less communication and storage.

Constructing Vertex-Cuts Evenly assign edges to machines – Minimize machines spanned by each vertex Assign each edge as it is loaded – Touch each edge only once Propose two distributed approaches: – Random Vertex Cut – Greedy Vertex Cut 37

Machine 2 Machine 1 Machine 3 Random Vertex-Cut Randomly assign edges to machines Y Y Y YZYYYYZ YZ Y Spans 3 Machines Z Spans 2 Machines Balanced Vertex-Cut Not cut! 38

Analysis Random Edge-Placement Expected number of machines spanned by a vertex: Twitter Follower Graph 41 Million Vertices 1.4 Billion Edges Accurately Estimate Memory and Comm. Overhead 39

Random Vertex-Cuts vs. Edge-Cuts Expected improvement from vertex-cuts: 40 Order of Magnitude Improvement

Streaming Greedy Vertex-Cuts Place edges on machines which already have the vertices in that edge. Machine1 Machine 2 BACB DAEB 41

Greedy Vertex-Cuts Improve Performance Greedy partitioning improves computation performance. 42

System Design Implemented as C++ API Uses HDFS for Graph Input and Output Fault-tolerance is achieved by check-pointing – Snapshot time < 5 seconds for twitter network 43 EC2 HPC Nodes MPI/TCP-IP PThreads HDFS PowerGraph (GraphLab2) System

Implemented Many Algorithms Collaborative Filtering – Alternating Least Squares – Stochastic Gradient Descent Statistical Inference – Loopy Belief Propagation – Max-Product Linear Programs – Gibbs Sampling Graph Analytics – PageRank – Triangle Counting – Shortest Path – Graph Coloring – K-core Decomposition Computer Vision – Image stitching Language Modeling – LDA 44

PageRank on the Twitter Follower Graph 45 Total Network (GB) Seconds CommunicationRuntime Natural Graph with 40M Users, 1.4 Billion Links Reduces Communication Runs Faster 32 Nodes x 8 Cores (EC2 HPC cc1.4x)

PageRank on Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. ‘10] 46 Order of magnitude by exploiting properties of Natural Graphs

GraphLab2 is Scalable Yahoo Altavista Web Graph (2002): One of the largest publicly available web graphs 1.4 Billion Webpages, 6.6 Billion Links 1024 Cores (2048 HT) 64 HPC Nodes 7 Seconds per Iter. 1B links processed per second 30 lines of user code 47

Topic Modeling English language Wikipedia – 2.6M Documents, 8.3M Words, 500M Tokens – Computationally intensive algorithm Yahoo! Machines Specifically engineered for this task 64 cc2.8xlarge EC2 Nodes 200 lines of code & 4 human hours

Counted: 34.8 Billion Triangles 49 Triangle Counting on The Twitter Graph Identify individuals with strong communities. 64 Machines 1.5 Minutes 1536 Machines 423 Minutes Hadoop [WWW’11] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’ x Faster Why? Wrong Abstraction  Broadcast O(degree 2 ) messages per Vertex

EC2 HPC Nodes MPI/TCP-IP PThreads HDFS GraphLab2 System Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering Machine Learning and Data-Mining Toolkits Apache 2 License

GraphChi: Going small with GraphLab Solve huge problems on small or embedded devices? Key: Exploit non-volatile memory (starting with SSDs and HDs)

GraphChi – disk-based GraphLab Novel Parallel Sliding Windows algorithm Single-Machine – Parallel, asynchronous execution Solves big problems – That are normally solved in cloud Efficiently exploits disks – Optimized for stream acces – Efficient on both SSD and hard-drives

Triangle Counting in Twitter Graph 40M Users 1.2B Edges Total: 34.8 Billion Triangles Hadoop results from [Suri & Vassilvitskii '11] 64 Machines, 1024 Cores 1.5 Minutes 1536 Machines 423 Minutes 59 Minutes, 1 Mac Mini!

Orders of magnitude improvements over existing systems New ways execute graph algorithms Machine 1 Machine 2 New ways to represent real-world graphs

Ongoing Work 55

GraphLab Abstraction & Representation GraphLab Abstraction & Representation Dataflow Execution on Spark Scalable, Fault-Tolerant, and Interactive Graph Computation Graph X

Efficiently Encoding Graphs as RDDs Equivalent to PowerGraph Representation 57 C C A A D D E E B B AB AC AD AE ED A A B B C C D D E E Edge TableVertex Table Replication

Edge Table Src Dst Value AC AD … Part. Gather-Apply-Scatter Joins 58 Vertex Table Id Value Part. Active A C … ✓ ✗ Src = Id Dst = Id Gather Apply ++ Id isActive Efficient Communication: Equivalent to PowerGraph Edge data is always local Efficient Communication: Equivalent to PowerGraph Edge data is always local Disk Friendly Access Pattern: Edges touched sequentially Vertices touched randomly Disk Friendly Access Pattern: Edges touched sequentially Vertices touched randomly

Joining Tables and Graphs Simplify ETL and graph structured analytics User Data Product Ratings Friend Graph ETL Product Rec. Graph Join Inf. Prod. Rec. Tables Graphs

Easier to Program 60 #include struct vertex_data : public graphlab::IS_POD_TYPE { float rank; vertex_data() : rank(1) { } }; typedef graphlab::empty edge_data; typedef graphlab::distributed_graph graph_type; class pagerank : public graphlab::ivertex_program, public graphlab::IS_POD_TYPE { float last_change; public: float gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { return edge.source().data().rank / edge.source().num_out_edges(); } void apply(icontext_type& context, vertex_type& vertex, const gather_type& total) { const double newval = 0.15*total ; last_change = std::fabs(newval - vertex.data().rank); vertex.data().rank = newval; } void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { if (last_change > TOLERANCE) context.signal(edge.target()); } }; struct pagerank_writer { std::string save_vertex(graph_type::vertex_type v) { std::stringstream strm; strm << v.id() << "\t" << v.data() << "\n"; return strm.str(); } std::string save_edge(graph_type::edge_type e) { return ""; } }; int main(int argc, char** argv) { graphlab::mpi_tools::init(argc, argv); graphlab::distributed_control dc; graphlab::command_line_options clopts("PageRank algorithm."); graph_type graph(dc, clopts); graph.load_format(“biggraph.tsv”, "tsv"); graphlab::omni_engine engine(dc, graph, clopts); engine.signal_all(); engine.start(); graph.save(saveprefix, pagerank_writer(), false, true false); graphlab::mpi_tools::finalize(); return EXIT_SUCCESS; } import spark.graphlab._ val sc = spark.SparkContext(master, “pagerank”) val graph = Graph.textFile(“bigGraph.tsv”) val vertices = graph.outDegree().mapValues((_, 1.0, 1.0)) val pr = Graph(vertices, graph.edges).iterate( (meId, e) => e.source.data._2 / e.source.data._1, (a: Double, b: Double) => a + b, (v, accum) => (v.data._1, ( *a), v.data._2), (meId, e) => abs(e.source.data._2-e.source.data._1)>0.01) pr.vertices.saveAsTextFile(“results”) Interactive! Graph X

Enables Distributed Graph Computation: – Collaborative Filtering, Community Detection, PageRank, Shortest Path, Belief Propagation, Structured Clustering, Neural Networks, Gibbs Sampling, … Unifies Existing Abstractions: – GraphLab, Pregel, Visitor Patterns, … Graph Computation on Spark Joint work with Reynold Xin, Mike Franklin Graph X

Continuously Changing Graphs – New vertices – New relationships – Changing attributes Incrementally Maintain – Centrality, Predictions – Efficiently & Real-time Next Big Graph Challenge 62 Materialized view maintenance for graph algorithms

Merci Beaucoup! Joseph Gonzalez Postdoc, UC Berkeley AMPLab