Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs.

Similar presentations


Presentation on theme: "Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs."— Presentation transcript:

1 Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs Jay Gu Joseph Gonzalez The GraphLab Team:

2 Parallelism is Difficult Wide array of different parallel architectures: Different challenges for each architecture GPUsMulticoreClustersCloudsSupercomputers High Level Abstractions to make things easier

3 How will we design and implement parallel learning systems?

4 Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions... a popular answer:

5 Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics

6 Example of Graph Parallelism

7 PageRank Example Iterate: Where: α is the random reset probability L[j] is the number of links on page j

8 Properties of Graph Parallel Algorithms Dependency Graph Iterative Computation My Rank Friends Rank Local Updates

9 Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Addressing Graph-Parallel ML We need alternatives to Map-Reduce Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce? Pregel (Giraph)?

10 Barrier Pregel (Giraph) Bulk Synchronous Parallel Model: ComputeCommunicate

11 PageRank in Giraph (Pregel) bsp_page_rank() { sum = 0 forall (message in in_messages()) sum = sum + message rank = ALPHA + (1-ALPHA) * sum; set_vertex_value(rank); if (current_super_step() < MAX_STEPS) { nedges = num_out_edges() forall (neighbors in out_neighbors()) send_message(rank / nedges); } else vote_to_halt(); } Sum PageRank over incoming messages Send new messages to neighbors or terminate

12 Bulk synchronous computation can be highly inefficient Problem:

13 BSP Systems Problem: Curse of the Slow Job Data CPU 1 CPU 2 CPU 3 CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier Data Barrier

14 Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso The Need for a New Abstraction If not Pregel, then what? Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Pregel (Giraph)

15 The GraphLab Solution Designed specifically for ML needs Express data dependencies Iterative Simplifies the design of parallel programs: Abstract away hardware issues Automatic data synchronization Addresses multiple hardware architectures Multicore Distributed Cloud computing GPU implementation in progress

16 What is GraphLab?

17 The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation

18 Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights Graph: Social Network

19 pagerank(i, scope){ // Get Neighborhood data (R[i], W ij, R[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex Dynamic computation

20 PageRank in GraphLab GraphLab_pagerank(scope) { sum = 0 forall ( nbr in scope.in_neighbors() ) sum = sum + neighbor.value() / nbr.num_out_edges() old_rank = scope.vertex_data() scope.center_value() = ALPHA + (1-ALPHA) * sum double residual = abs(scope.center_value() – old_rank) if (residual > EPSILON) reschedule_out_neighbors() }

21 PageRank in GraphLab2 struct pagerank : public iupdate_functor { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; double old_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; Actual GraphLab2 Code! BE MORE CLEAR

22 The Scheduler CPU 1 CPU 2 The scheduler determines the order that vertices are updated e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler The process repeats until the scheduler is empty

23 The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation

24 Ensuring Race-Free Code How much can computation overlap?

25 Need for Consistency? No Consistency Higher Throughput (#updates/sec) Potentially Slower Convergence of ML

26 Inconsistent ALS Netflix data, 8 cores Consistent

27 Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors ) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum …

28 Inconsistent PageRank

29 Point of Convergence

30 Even Simple PageRank can be Dangerous GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum … CPU 1 CPU 2 Read Read-write race CPU 1 reads bad PageRank estimate, as CPU 2 computes value

31 Race Condition Can Be Very Subtle GraphLab_pagerank(scope) { ref sum = scope.center_value sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / neighbor.num_out_edges sum = ALPHA + (1-ALPHA) * sum … GraphLab_pagerank(scope) { sum = 0 forall (neighbor in scope.in_neighbors) sum = sum + neighbor.value / nbr.num_out_edges sum = ALPHA + (1-ALPHA) * sum scope.center_value = sum … Unstable Stable This was actually encountered in user code.

32 GraphLab Ensures Sequential Consistency For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time

33 Consistency Rules Guaranteed sequential consistency for all update functions Data

34 Full Consistency

35 Obtaining More Parallelism

36 Edge Consistency CPU 1 CPU 2 Safe Read

37 The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation

38 Carnegie Mellon University What algorithms are implemented in GraphLab?

39 Bayesian Tensor Factorization Gibbs Sampling Dynamic Block Gibbs Sampling Matrix Factorization Lasso SVM Belief Propagation PageRank CoEM K-Means SVD LDA …Many others… Linear Solvers Splash Sampler Alternating Least Squares

40 GraphLab Libraries Matrix factorization SVD,PMF, BPTF, ALS, NMF, Sparse ALS, Weighted ALS, SVD++, time-SVD++, SGD Linear Solvers Jacobi, GaBP, Shotgun Lasso, Sparse logistic regression, CG Clustering K-means, Fuzzy K-means, LDA, K-core decomposition Inference Discrete BP, NBP, Kernel BP

41 Carnegie Mellon University Efficient Multicore Collaborative Filtering LeBuSiShu team – 5 th place in track1 Institute of Automation Chinese Academy of Sciences Machine Learning Dept Carnegie Mellon University ACM KDD CUP Workshop 2011 Yao WuQiang YanDanny BicksonYucheng LowQing Yang

42

43 ACM KDD CUP 2011 Task: predict music score Two main challenges: Data magnitude – 260M ratings Taxonomy of data

44 Data taxonomy

45 Our approach Use ensemble method Custom SGD algorithm for handling taxonomy

46 Ensemble method Solutions are merged using linear regression

47 Performance results Blended Validation RMSE: 19.90

48 Classical Matrix Factorization Sparse Matrix Users Item d

49 MFITR Sparse Matrix Users d Features of the Artist Features of the Album Item Specific Features Effective Feature of an Item

50 Intuitively, features of an artist and features of his/her album should be similar. How do we express this? Album Artist Track Penalty terms which ensure Artist/Album/Track features are close Strength of penalty depends on normalized rating similarity (See neighborhood model)

51 Fine Tuning Challenge Dataset has around 260M observed ratings 12 different algorithms, total 53 tunable parameters How do we train and cross validate all these parameters? USE GRAPHLAB!

52 16 Cores Runtime

53 Speedup plots

54

55 Carnegie Mellon University Who is using GraphLab?

56 Universities using GraphLab

57 Companies tyring out GraphLab Unique Downloads Tracked (possibly many more from direct repository checkouts) Unique Downloads Tracked (possibly many more from direct repository checkouts) Startups using GraphLab

58 User community

59 Performance results

60 GraphLab vs. Pregel (BSP) Multicore PageRank (25M Vertices, 355M Edges) 51% updated only once GraphLab Pregel (via GraphLab) GraphLab Pregel (via GraphLab)

61 CoEM (Rosie Jones, 2005) Named Entity Recognition Task the dog Australia Catalina Island ran quickly travelled to is pleasant Hadoop95 Cores7.5 hrs Is Dog an animal? Is Catalina a place? Vertices: 2 Million Edges: 200 Million

62 Better Optimal GraphLab CoEM CoEM (Rosie Jones, 2005) 62 GraphLab16 Cores30 min 15x Faster!6x fewer CPUs! Hadoop95 Cores7.5 hrs

63 Carnegie Mellon GraphLab in the Cloud

64 CoEM (Rosie Jones, 2005) Optimal Better Small Large GraphLab16 Cores30 min Hadoop95 Cores7.5 hrs GraphLab in the Cloud 32 EC2 machines 80 secs 0.3% of Hadoop time

65 Cost-Time Tradeoff video co-segmentation results more machines, higher cost faster a few machines helps a lot diminishing returns

66 Netflix Collaborative Filtering Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Netflix Users Movies D Hadoop MPI GraphLab Ideal D=100 D=20

67 Multicore Abstraction Comparison Netflix Matrix Factorization Dynamic Computation, Faster Convergence Dynamic Computation, Faster Convergence

68 The Cost of Hadoop

69 Carnegie Mellon University Fault Tolerance

70 Fault-Tolerance Larger Problems Increased chance of Machine Failure GraphLab2 Introduces two fault tolerance (checkpointing) mechanisms Synchronous Snapshots Chandi-Lamport Asynchronous Snapshots

71 Synchronous Snapshots Run GraphLab Barrier + Snapshot Time Run GraphLab Barrier + Snapshot Run GraphLab

72 Curse of the slow machine sync. Snapshot No Snapshot

73 Curse of the Slow Machine Run GraphLab Time Barrier + Snapshot Run GraphLab

74 Curse of the slow machine sync. Snapshot No Snapshot Delayed sync. Snapshot

75 Asynchronous Snapshots struct chandy_lamport { void operator()(icontext_type& context) { save(context.vertex_data()); foreach ( edge_type edge, context.in_edges() ) { if (edge.source() was not marked as saved) { save(context.edge_data(edge)); context.schedule(edge.source(), chandy_lamport()); }... Repeat for context.out_edges Mark context.vertex() as saved; } }; Chandy Lamport algorithm implementable as a GraphLab update function! Requires edge consistency

76 Snapshot Performance Async. Snapshot sync. Snapshot No Snapshot

77 Snapshot with 15s fault injection No Snapshot Async. Snapshot sync. Snapshot Halt 1 out of 16 machines 15s

78 New challenges

79 Natural Graphs Power Law Top 1% of vertices is adjacent to 53% of the edges! Yahoo! Web Graph: 1.4B Verts, 6.7B Edges Power Law

80 Problem: High Degree Vertices High degree vertices limit parallelism: Touch a Large Amount of State Requires Heavy Locking Processed Sequentially

81 Split gather and scatter across machines: High Communication in Distributed Updates Y Machine 1Machine 2 Data from neighbors transmitted separately across network Data from neighbors transmitted separately across network

82 High Degree Vertices are Common Users Movies Netflix Social People Popular Movies θ θ Z Z w w Z Z w w Z Z w w Z Z w w θ θ Z Z w w Z Z w w Z Z w w Z Z w w θ θ Z Z w w Z Z w w Z Z w w Z Z w w θ θ Z Z w w Z Z w w Z Z w w Z Z w w B B α α Hyper Parameters Docs Words LDA Common Words Obama

83 Factorized Update Functors Delta Update Functors Two Core Changes to Abstraction Monolithic Updates Gather ApplyScatter Decomposed Updates Monolithic UpdatesComposable Update Messages f1f1 f2f2 (f 1 o f 2 )( )

84 PageRank in GraphLab struct pagerank : public iupdate_functor { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach ( edge_type edge, context.in_edges() ) sum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; double old_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; Atomic Single Vertex Apply Parallel Scatter [Reschedule] Parallel Sum Gather BE MORE CLEAR

85 Decomposable Update Functors Locks are acquired only for region within a scope Relaxed Consistency + + … + Δ Y Y Y Parallel Sum User Defined: Gather( ) Δ Y Δ 1 + Δ 2 Δ 3 Y Scope Gather Y Y Apply(, Δ) Y Apply the accumulated value to center vertex User Defined: Apply Y Scatter( ) Update adjacent edges and vertices. User Defined: Y Scatter

86 Factorized PageRank double gather(scope, edge) { return edge.source().value().rank / scope.num_out_edge(edge.source()) } double merge(acc1, acc2) { return acc1 + acc2 } void apply(scope, accum) { old_value = scope.center_value().rank scope.center_value().rank = ALPHA + (1 - ALPHA) * accum scope.center_value().residual = abs(scope.center_value().rank – old_value) } void scatter(scope, edge) { if (scope.center_vertex().residual > EPSILON) reschedule_schedule(edge.target()) }

87 Y Y Split gather and scatter across machines: Factorized Updates: Significant Decrease in Communication ( o )( ) Y Y Y F1F1 F1F1 F2F2 F2F2 Y Y Y Y Small amount of data transmitted over network

88 Factorized Consistency Neighboring vertices maybe be updated simultaneously: A A B B Gather

89 Apply Factorized Consistency Locking Gather on an edge cannot occur during apply: A A B B Gather Vertex B gathers on other neighbors while A is performing Apply

90 Factorized PageRank struct pagerank : public iupdate_functor { double accum = 0, residual = 0; void gather(icontext_type& context, const edge_type& edge) { accum += 1/context.num_out_edges(edge.source()) * context.vertex_data(edge.source()).rank; } void merge(const pagerank& other) { accum += other.accum; } void apply(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double old_value = vdata.rank; vdata.rank = RESET_PROB + (1 - RESET_PROB) * accum; residual = fabs(vdata.rank – old_value) / context.num_out_edges(); } void scatter(icontext_type& context, const edge_type& edge) { if (residual > EPSILON) context.schedule(edge.target(), pagerank()); } }; BE MORE CLEAR

91 Decomposable Loopy Belief Propagation Gather: Accumulates product of in messages Apply: Updates central belief Scatter: Computes out messages and schedules adjacent vertices

92 Decomposable Alternating Least Squares (ALS) y1y1 y2y2 y3y3 y4y4 w1w1 w2w2 x1x1 x2x2 x3x3 User Factors (W) Movie Factors (X) Users Movies Netflix Users x Movies Gather: Sum terms wiwi xjxj Update Function: Apply: matrix inversion & multiply

93 Decomposable Functors Fits many algorithms Loopy Belief Propagation, Label Propagation, PageRank… Addresses the earlier concerns: Large State Distributed Gather and Scatter Heavy Locking Fine Grained Locking Sequential Parallel Gather and Scatter

94 Comparison of Abstractions Multicore PageRank (25M Vertices, 355M Edges) GraphLab1 Factorized Updates

95 Need for Vertex Level Asynchrony Exploit commutative associative sum Y Y Costly gather for a single change!

96 Commut-Assoc Vertex Level Asynchrony Exploit commutative associative sum Y Y

97 Exploit commutative associative sum Δ Y Y Commut-Assoc Vertex Level Asynchrony + Δ

98 Delta Updates: Vertex Level Asynchrony Exploit commutative associative sum Δ Y Old (Cached) Sum Y

99 Exploit commutative associative sum Y ΔΔ Delta Updates: Vertex Level Asynchrony Δ Y Old (Cached) Sum

100 Delta Update void update(scope, delta) { scope.center_value() = scope.center_value() + delta if(abs(delta) > EPSILON) { out_delta = delta * (1 – ALPHA) * 1 / scope.num_out_edge(edge.source()) reschedule_out_neighbors(delta) } double merge(delta, delta) { return delta + delta } Program starts with: schedule_all(ALPHA)

101 Scheduling Composes Updates Calling reschedule neighbors forces update function composition: pagerank(3)Pending: pagerank(7) reschedule_out_neighbors(pagerank(3))pagerank(3) Pending: pagerank(3) Pending: pagerank(10)

102 Multicore Abstraction Comparison Multicore PageRank (25M Vertices, 355M Edges)

103 Distributed Abstraction Comparison Distributed PageRank (25M Vertices, 355M Edges) GraphLab1 GraphLab2 (Delta Updates) GraphLab1 GraphLab2 (Delta Updates)

104 PageRank Altavista Webgraph B vertices, 6.7B edges Hadoop9000 s800 coresPrototype GraphLab2431s512 cores Known Inefficiencies. 2x gain possible Known Inefficiencies. 2x gain possible

105 Decomposed Update Functions: Expose parallelism in high-degree vertices: Delta Update Functions: Expose asynchrony in high- degree vertices Summary of GraphLab Gather ApplyScatter Y Y Δ

106 Lessons Learned Machine Learning: Asynchronous often much faster than Synchronous Dynamic computation often faster However, can be difficult to define optimal thresholds: Science to do! Consistency can improve performance Sometimes required for convergence Though there are cases where relaxed consistency is sufficient System: Distributed asynchronous systems are harder to build But, no distributed barriers == better scalability and performance Scaling up by an order of magnitude requires rethinking of design assumptions E.g., distributed graph representation High degree vertices & natural graphs can limit parallelism Need further assumptions on update functions

107 Summary An abstraction tailored to Machine Learning Targets Graph-Parallel Algorithms Naturally expresses Data/computational dependencies Dynamic iterative computation Simplifies parallel algorithm design Automatically ensures data consistency Achieves state-of-the-art parallel performance on a variety of problems

108 Carnegie Mellon Parallel GraphLab 1.1 Multicore Available Today GraphLab2 (in the Cloud) soon… Documentation… Code… Tutorials…


Download ppt "Carnegie Mellon University Danny Bickson Yucheng Low Aapo Kyrola Carlos Guestrin Joe Hellerstein Alex Smola Parallel Machine Learning for Large-Scale Graphs."

Similar presentations


Ads by Google