Danny Bickson Parallel Machine Learning for Large-Scale Graphs

Slides:



Advertisements
Similar presentations
Scaling Up Graphical Model Inference
Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
BiG-Align: Fast Bipartite Graph Alignment
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
GraphChi: Big Data – small machine
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New Parallel Framework.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron A New.
Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next.
Carnegie Mellon Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Kanat Tangwon- gsan Carlos Guestrin Guy Blelloch Joe Hellerstein.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Graph Processing Recap: data-intensive cloud computing – Just database management on the cloud – But scaling it to thousands of nodes – Handling partial.
GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Big Learning with Graph Computation Joseph Gonzalez Download the talk:
Joseph Gonzalez Yucheng Low Aapo Kyrola Danny Bickson Joe Hellerstein Alex Smola Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu The.
Sebastian Schelter, Venu Satuluri, Reza Zadeh
GraphLab A New Framework for Parallel Machine Learning
Network Support for Cloud Services Lixin Gao, UMass Amherst.
Pregel: A System for Large-Scale Graph Processing
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Carnegie Mellon Machine Learning in the Cloud Yucheng Low Aapo Kyrola Danny Bickson Joey Gonzalez Carlos Guestrin Joe Hellerstein David O’Hallaron.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Joseph Gonzalez Yucheng Low Danny Bickson Distributed Graph-Parallel Computation on Natural Graphs Haijie Gu Joint work with: Carlos Guestrin.
Carnegie Mellon Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Structures and Algorithms in Parallel Computing
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Factorbird: a Parameter Server Approach to Distributed Matrix Factorization Sebastian Schelter, Venu Satuluri, Reza Zadeh Distributed Machine Learning.
Carnegie Mellon University Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Joe Hellerstein Alex Smola The Next Generation.
PowerGraph: Distributed Graph- Parallel Computation on Natural Graphs Joseph E. Gonzalez, Yucheng Low, Haijie Gu, and Danny Bickson, Carnegie Mellon University;
A Distributed Framework for Machine Learning and Data Mining in the Cloud BY YUCHENG LOW, JOSEPH GONZALEZ, AAPO KYROLA, DANNY BICKSON, CARLOS GUESTRIN.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
TensorFlow– A system for large-scale machine learning
World’s fastest Machine Learning Engine
Big Data: Graph Processing
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
CSCI5570 Large Scale Data Processing Systems
A New Parallel Framework for Machine Learning
PREGEL Data Management in the Cloud
Distributed Graph-Parallel Computation on Natural Graphs
CSCI5570 Large Scale Data Processing Systems
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Distributed Systems CS
Replication-based Fault-tolerance for Large-scale Graph Processing
Scalable Parallel Interoperable Data Analytics Library
HPML Conference, Lyon, Sept 2018
COS 418: Distributed Systems Lecture 19 Wyatt Lloyd
Big Data I: Graph Processing, Distributed Machine Learning
GANG: Detecting Fraudulent Users in OSNs
Presentation transcript:

Danny Bickson Parallel Machine Learning for Large-Scale Graphs The GraphLab Team: Joe Hellerstein Alex Smola Yucheng Low Joseph Gonzalez Aapo Kyrola Jay Gu Carlos Guestrin

Parallelism is Difficult Wide array of different parallel architectures: Different challenges for each architecture GPUs Multicore Clusters Clouds Supercomputers High Level Abstractions to make things easier

How will we design and implement parallel learning systems?

Build learning algorithms on-top of high-level parallel abstractions ... a popular answer: Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions

Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Map Reduce Feature Extraction Cross Validation Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Tensor Factorization PageRank Lasso Computing Sufficient Statistics

Example of Graph Parallelism

PageRank Example Iterate: Where: α is the random reset probability L[j] is the number of links on page j 1 2 3 4 5 6

Properties of Graph Parallel Algorithms Dependency Graph Local Updates Iterative Computation My Rank Friends Rank

Addressing Graph-Parallel ML We need alternatives to Map-Reduce Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph)? Map Reduce? Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Tensor Factorization PageRank Lasso Feature Extraction Cross Validation Computing Sufficient Statistics

Pregel (Giraph) Compute Communicate Bulk Synchronous Parallel Model: Barrier

PageRank in Giraph (Pregel) bsp_page_rank() { sum = 0 forall (message in in_messages()) sum = sum + message rank = ALPHA + (1-ALPHA) * sum; set_vertex_value(rank); if (current_super_step() < MAX_STEPS) { nedges = num_out_edges() forall (neighbors in out_neighbors()) send_message(rank / nedges); } else vote_to_halt(); } Sum PageRank over incoming messages Send new messages to neighbors or terminate

Bulk synchronous computation can be highly inefficient Problem: Bulk synchronous computation can be highly inefficient

BSP Systems Problem: Curse of the Slow Job Iterations Barrier Barrier Data Barrier Data Data Data CPU 1 CPU 2 CPU 1 CPU 1 Data CPU 2 CPU 2 Data Data CPU 3 CPU 3 CPU 3 Data Data Data

The Need for a New Abstraction If not Pregel, then what? Data-Parallel Graph-Parallel Map Reduce Pregel (Giraph) Feature Extraction Cross Validation Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Tensor Factorization PageRank Lasso Computing Sufficient Statistics

The GraphLab Solution Designed specifically for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Automatic data synchronization Addresses multiple hardware architectures Multicore Distributed Cloud computing GPU implementation in progress

What is GraphLab?

The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model Scheduler

Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Graph: Social Network Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights

Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Dynamic computation

PageRank in GraphLab GraphLab_pagerank(scope) { sum = 0 forall ( nbr in scope.in_neighbors() ) sum = sum + neighbor.value() / nbr.num_out_edges() old_rank = scope.vertex_data() scope.center_value() = ALPHA + (1-ALPHA) * sum double residual = abs(scope.center_value() – old_rank) if (residual > EPSILON) reschedule_out_neighbors() }

Actual GraphLab2 Code! PageRank in GraphLab2 BE MORE CLEAR struct pagerank : public iupdate_functor<graph, pagerank> { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; double old_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; BE MORE CLEAR

The Scheduler Scheduler The scheduler determines the order that vertices are updated CPU 1 e f g k j i h d c b a b c Scheduler e f b a i h i j CPU 2 The process repeats until the scheduler is empty

The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model Scheduler

Ensuring Race-Free Code How much can computation overlap?

Potentially Slower Convergence of ML Need for Consistency? No Consistency Higher Throughput (#updates/sec) Potentially Slower Convergence of ML

Inconsistent ALS Consistent Netflix data, 8 cores Full netflix, 8 cores Highly connected movies, bad intermediate results Netflix data, 8 cores

Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors ) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum …

Inconsistent PageRank 8 cores,

Point of Convergence

Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum … CPU 1 CPU 2 Read Read-write race  CPU 1 reads bad PageRank estimate, as CPU 2 computes value

Race Condition Can Be Very Subtle GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / neighbor.num_out_edges sum = ALPHA + (1-ALPHA) * sum … Unstable GraphLab_pagerank(scope) { sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum scope.center_value = sum … Stable This was actually encountered in user code.

GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 time Parallel CPU 2 Single CPU Sequential

Consistency Rules Full Consistency Data Guaranteed sequential consistency for all update functions

Full Consistency Full Consistency

Obtaining More Parallelism Full Consistency Edge Consistency

Edge Consistency Edge Consistency CPU 1 CPU 2 Safe Read

The GraphLab Framework Graph Based Data Representation Update Functions User Computation Consistency Model Scheduler

What algorithms are implemented in GraphLab?

Dynamic Block Gibbs Sampling Alternating Least Squares SVD Splash Sampler CoEM Bayesian Tensor Factorization Lasso Belief Propagation PageRank LDA SVM Gibbs Sampling Dynamic Block Gibbs Sampling K-Means Matrix Factorization …Many others… Linear Solvers

GraphLab Libraries Matrix factorization Linear Solvers Clustering SVD,PMF, BPTF, ALS, NMF, Sparse ALS, Weighted ALS, SVD++, time-SVD++, SGD Linear Solvers Jacobi, GaBP, Shotgun Lasso, Sparse logistic regression, CG Clustering K-means, Fuzzy K-means, LDA, K-core decomposition Inference Discrete BP, NBP, Kernel BP

Institute of Automation Chinese Academy of Sciences Efficient Multicore Collaborative Filtering LeBuSiShu team – 5th place in track1 Yao Wu Qiang Yan Qing Yang Danny Bickson Yucheng Low Institute of Automation Chinese Academy of Sciences Machine Learning Dept Carnegie Mellon University ACM KDD CUP Workshop 2011

ACM KDD CUP 2011 Task: predict music score Two main challenges: Data magnitude – 260M ratings Taxonomy of data

Data taxonomy

Our approach Use ensemble method Custom SGD algorithm for handling taxonomy This graph shows performance of the different methods using RMSE metric (root square mean error) Note that time-MFITR has very good performance after time-svd++

Ensemble method Solutions are merged using linear regression

Performance results Blended Validation RMSE: 19.90 This graph shows performance of the different methods using RMSE metric (root square mean error) Note that time-MFITR has very good performance after time-svd++

Classical Matrix Factorization Sparse Matrix Users Item MFITR is our developed novel method for coping with KDD characteristics of data, namely hierarchy of track, album artist and genere. It is composed of two elements. This slides discusses the first element. r_ui – is the scalar predicted rating between user u and item i. We have here a linear prediction rule. Mu – is the model mean b_i, b_u, b_a are biases of item user and artiest, which are learned from the data. q_i, q_a, p_u are feature vectors which are learned form the data 1) In addition to linear model of matrix factorization who factor the model into user and feature vectors, we add an additional feature which is the artist feature (noted q_a) d

MFITR Features of the Artist Features of the Album Sparse Matrix Users Features of the Artist Features of the Album Item Specific Features “Effective Feature of an Item” d MFITR is our developed novel method for coping with KDD characteristics of data, namely hierarchy of track, album artist and genere. It is composed of two elements. This slides discusses the first element. r_ui – is the scalar predicted rating between user u and item i. We have here a linear prediction rule. Mu – is the model mean b_i, b_u, b_a are biases of item user and artiest, which are learned from the data. q_i, q_a, p_u are feature vectors which are learned form the data 1) In addition to linear model of matrix factorization who factor the model into user and feature vectors, we add an additional feature which is the artist feature (noted q_a)

Penalty terms which ensure Artist/Album/Track features are “close” Intuitively, features of an artist and features of his/her album should be “similar”. How do we express this? Artist Penalty terms which ensure Artist/Album/Track features are “close” Strength of penalty depends on “normalized rating similarity” (See neighborhood model) Album Track

Fine Tuning Challenge Dataset has around 260M observed ratings 12 different algorithms, total 53 tunable parameters How do we train and cross validate all these parameters? USE GRAPHLAB!

16 Cores Runtime This plot shows run time using 8 cores. While SGD is very fast, it has a worst speedup relative to ALS.

Speedup plots Yucheng knows what is speedup – so I don’t need to write it down…  Anyway we can see that alternating least squares style algo perform very well since once a subset of nodes (user or movie) are fixes, all the other nodes (movies/user) can be run in parallel. SGD and SVD++ perform less well, since when two users have seen the same movie they need to update the movie feature vector at the same time

Who is using GraphLab?

Universities using GraphLab

Companies tyring out GraphLab Startups using GraphLab 2400++ Unique Downloads Tracked (possibly many more from direct repository checkouts) Companies tyring out GraphLab

User community

Performance results

GraphLab vs. Pregel (BSP) (via GraphLab) GraphLab Pregel (via GraphLab) 51% updated only once Multicore PageRank (25M Vertices, 355M Edges)

CoEM (Rosie Jones, 2005) Vertices: 2 Million Edges: 200 Million Hadoop Named Entity Recognition Task the dog Australia Catalina Island <X> ran quickly travelled to <X> <X> is pleasant Is “Dog” an animal? Is “Catalina” a place? Vertices: 2 Million Edges: 200 Million Hadoop 95 Cores 7.5 hrs

CoEM (Rosie Jones, 2005) Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores Better Optimal GraphLab CoEM Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min 15x Faster! 6x fewer CPUs! 62

GraphLab in the Cloud

CoEM (Rosie Jones, 2005) 0.3% of Hadoop time Hadoop 95 Cores 7.5 hrs Better Optimal Large Hadoop 95 Cores 7.5 hrs GraphLab 16 Cores 30 min Small GraphLab in the Cloud 32 EC2 machines 80 secs 0.3% of Hadoop time

Cost-Time Tradeoff a few machines helps a lot faster video co-segmentation results a few machines helps a lot faster diminishing returns more machines, higher cost

Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Ideal D=100 D=20 Netflix Users Movies Hadoop MPI GraphLab D

Multicore Abstraction Comparison Dynamic Computation, Faster Convergence Netflix Matrix Factorization

The Cost of Hadoop

Fault Tolerance

Fault-Tolerance Larger Problems  Increased chance of Machine Failure GraphLab2 Introduces two fault tolerance (checkpointing) mechanisms Synchronous Snapshots Chandi-Lamport Asynchronous Snapshots

Synchronous Snapshots Run GraphLab Run GraphLab Barrier + Snapshot Time Run GraphLab Run GraphLab Barrier + Snapshot Run GraphLab Run GraphLab

Curse of the slow machine No Snapshot sync. Snapshot

Curse of the Slow Machine Run GraphLab Run GraphLab Time Barrier + Snapshot Run GraphLab Run GraphLab

Curse of the slow machine No Snapshot Delayed sync. Snapshot sync. Snapshot

Asynchronous Snapshots Chandy Lamport algorithm implementable as a GraphLab update function! Requires edge consistency struct chandy_lamport { void operator()(icontext_type& context) { save(context.vertex_data()); foreach ( edge_type edge, context.in_edges() ) { if (edge.source() was not marked as saved) { save(context.edge_data(edge)); context.schedule(edge.source(), chandy_lamport()); } ... Repeat for context.out_edges Mark context.vertex() as saved; };

Snapshot Performance Async. Snapshot No Snapshot sync. Snapshot

Snapshot with 15s fault injection Halt 1 out of 16 machines 15s sync. Snapshot No Snapshot Async. Snapshot

New challenges

Natural Graphs  Power Law Yahoo! Web Graph: 1.4B Verts, 6.7B Edges “Power Law” Top 1% of vertices is adjacent to 53% of the edges!

Problem: High Degree Vertices High degree vertices limit parallelism: Requires Heavy Locking Touch a Large Amount of State Processed Sequentially

High Communication in Distributed Updates Split gather and scatter across machines: Machine 1 Machine 2 Y Data from neighbors transmitted separately across network

High Degree Vertices are Common Popular Movies “Social” People Users Movies Netflix Hyper Parameters Common Words Obama Docs Words LDA θ Z w B α θ Z w θ Z w θ Z w

Two Core Changes to Abstraction Factorized Update Functors Delta Update Functors Monolithic Updates + Gather Apply Scatter Decomposed Updates Monolithic Updates Composable Update “Messages” f1 f2 (f1o f2)( )

PageRank in GraphLab Parallel “Sum” Gather Atomic Single Vertex Apply struct pagerank : public iupdate_functor<graph, pagerank> { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; double old_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; Parallel “Sum” Gather Atomic Single Vertex Apply Parallel Scatter [Reschedule] BE MORE CLEAR

Decomposable Update Functors Locks are acquired only for region within a scope  Relaxed Consistency Gather Apply Y Scatter( )  Update adjacent edges and vertices. User Defined: Scatter Scope Y Y Y Parallel Sum Y Y Apply( , Δ)  Apply the accumulated value to center vertex User Defined: + + … +  Δ Y User Defined: Gather( )  Δ Y Δ1 + Δ2  Δ3

Factorized PageRank double gather(scope, edge) { return edge.source().value().rank / scope.num_out_edge(edge.source()) } double merge(acc1, acc2) { return acc1 + acc2 } void apply(scope, accum) { old_value = scope.center_value().rank scope.center_value().rank = ALPHA + (1 - ALPHA) * accum scope.center_value().residual = abs(scope.center_value().rank – old_value) void scatter(scope, edge) { if (scope.center_vertex().residual > EPSILON) reschedule_schedule(edge.target())

Factorized Updates: Significant Decrease in Communication Split gather and scatter across machines: Y Y F1 F2 ( o )( ) Y Y Y Y Small amount of data transmitted over network

Factorized Consistency Neighboring vertices maybe be updated simultaneously: Gather Gather A B

Factorized Consistency Locking Gather on an edge cannot occur during apply: Gather A B Apply Vertex B gathers on other neighbors while A is performing Apply

Factorized PageRank BE MORE CLEAR struct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0; void gather(icontext_type& context, const edge_type& edge) { accum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; } void merge(const pagerank& other) { accum += other.accum; } void apply(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double old_value = vdata.rank; vdata.rank = RESET_PROB + (1 - RESET_PROB) * accum; residual = fabs(vdata.rank – old_value) / context.num_out_edges(); void scatter(icontext_type& context, const edge_type& edge) { if (residual > EPSILON) context.schedule(edge.target(), pagerank()); }; BE MORE CLEAR

Decomposable Loopy Belief Propagation Gather: Accumulates product of in messages Apply: Updates central belief Scatter: Computes out messages and schedules adjacent vertices

Decomposable Alternating Least Squares (ALS) y1 y2 y3 y4 w1 w2 x1 x2 x3 User Factors (W) Movie Factors (X) Users Movies Netflix ≈ x xj wi Update Function: Gather: Sum terms Apply: matrix inversion & multiply

Decomposable Functors Fits many algorithms Loopy Belief Propagation, Label Propagation, PageRank… Addresses the earlier concerns: Large State Distributed Gather and Scatter Heavy Locking Fine Grained Locking Sequential Parallel Gather and Scatter

Comparison of Abstractions GraphLab1 Factorized Updates Multicore PageRank (25M Vertices, 355M Edges)

Need for Vertex Level Asynchrony Costly gather for a single change! Y Exploit commutative associative “sum” + + + + +  Y

Commut-Assoc Vertex Level Asynchrony Exploit commutative associative “sum” + + + + +  Y

Commut-Assoc Vertex Level Asynchrony + Δ Exploit commutative associative “sum” + + + + + + Δ  Y

Delta Updates: Vertex Level Asynchrony Exploit commutative associative “sum” + + + + + + Δ  Old (Cached) Sum Y

Delta Updates: Vertex Level Asynchrony Δ Exploit commutative associative “sum” + + + + + + Δ  Old (Cached) Sum Y

Delta Update Program starts with: schedule_all(ALPHA) void update(scope, delta) { scope.center_value() = scope.center_value() + delta if(abs(delta) > EPSILON) { out_delta = delta * (1 – ALPHA) * 1 / scope.num_out_edge(edge.source()) reschedule_out_neighbors(delta) } double merge(delta, delta) { return delta + delta } Slide 92 Program starts with: schedule_all(ALPHA)

Scheduling Composes Updates Calling reschedule neighbors forces update function composition: reschedule_out_neighbors(pagerank(3)) pagerank(3) pagerank(3) Pending: pagerank(10) Pending: pagerank(7) Pending: pagerank(3)

Multicore Abstraction Comparison Multicore PageRank (25M Vertices, 355M Edges)

Distributed Abstraction Comparison GraphLab1 GraphLab1 GraphLab2 (Delta Updates) GraphLab2 (Delta Updates) Distributed PageRank (25M Vertices, 355M Edges)

PageRank 1.4B vertices, 6.7B edges Altavista Webgraph 2002 Hadoop 800 cores Prototype GraphLab2 431s 512 cores Known Inefficiencies. 2x gain possible

Summary of GraphLab2 Decomposed Update Functions: Expose parallelism in high-degree vertices: Delta Update Functions: Expose asynchrony in high-degree vertices + Gather Apply Scatter Y Δ Y

Lessons Learned Machine Learning: System: Asynchronous often much faster than Synchronous Distributed asynchronous systems are harder to build Dynamic computation often faster But, no distributed barriers == better scalability and performance However, can be difficult to define optimal thresholds: Scaling up by an order of magnitude requires rethinking of design assumptions Science to do! Consistency can improve performance E.g., distributed graph representation Sometimes required for convergence High degree vertices & natural graphs can limit parallelism Though there are cases where relaxed consistency is sufficient Need further assumptions on update functions

Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems

Parallel GraphLab 1.1 Multicore Available Today GraphLab2 (in the Cloud) soon… Documentation… Code… Tutorials… http://graphlab.org