Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next.

Similar presentations


Presentation on theme: "Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next."— Presentation transcript:

1 Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next Generation of the GraphLab Abstraction.

2 How will we design and implement parallel learning systems?

3 Map-Reduce / Hadoop Build learning algorithms on-top of high-level parallel abstractions... a popular answer:

4 Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 4 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Is there more to Machine Learning ?

5 Concrete Example Label Propagation

6 Profile Label Propagation Algorithm Social Arithmetic: Recurrence Algorithm: iterate until convergence Parallelism: Compute all Likes[i] in parallel Sue Ann Carlos Me 50% What I list on my profile 40% Sue Ann Likes 10% Carlos Like 40% 10% 50% 80% Cameras 20% Biking 30% Cameras 70% Biking 50% Cameras 50% Biking I Like: + 60% Cameras, 40% Biking

7 Properties of Graph Parallel Algorithms Dependency Graph Iterative Computation What I Like What My Friends Like Factored Computation

8 ? Belief Propagation Label Propagation Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 8 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce?

9 Why not use Map-Reduce for Graph Parallel Algorithms?

10 Data Dependencies Map-Reduce does not efficiently express dependent data User must code substantial data transformations Costly data replication Independent Data Rows

11 Slow Processor Iterative Algorithms Map-Reduce not efficiently express iterative algorithms: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

12 MapAbuse: Iterative MapReduce Only a subset of data needs computation: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Barrier

13 MapAbuse: Iterative MapReduce System is not optimized for iteration: Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Iterations Disk Penalty Startup Penalty

14 Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso Map-Reduce for Data-Parallel ML Excellent for large data-parallel tasks! 14 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Map Reduce? Pregel (Giraph)?

15 Barrier Pregel (Giraph) Bulk Synchronous Parallel Model: ComputeCommunicate

16 Problem with Bulk Synchronous Example Algorithm: If Red neighbor then turn Red Bulk Synchronous Computation : Evaluate condition on all vertices for every phase 4 Phases each with 9 computations  36 Computations Asynchronous Computation (Wave-front) : Evaluate condition only when neighbor changes 4 Phases each with 2 computations  8 Computations Time 0 Time 1 Time 2Time 3Time 4

17 Belief Propagation SVM Kernel Methods Deep Belief Networks Neural Networks Tensor Factorization PageRank Lasso The Need for a New Abstraction Map-Reduce is not well suited for Graph-Parallelism 17 Data-Parallel Graph-Parallel Cross Validation Feature Extraction Map Reduce Computing Sufficient Statistics Pregel (Giraph)

18 What is GraphLab?

19 The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 19

20 Data Graph 20 A graph with arbitrary data (C++ Objects) associated with each vertex and edge. Vertex Data: User profile text Current interests estimates Edge Data: Similarity weights Graph: Social Network

21 label_prop(i, scope){ // Get Neighborhood data (Likes[i], W ij, Likes[j])  scope; // Update the vertex data // Reschedule Neighbors if needed if Likes[i] changes then reschedule_neighbors_of(i); } Update Functions 21 An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

22 The Scheduler 22 CPU 1 CPU 2 The scheduler determines the order that vertices are updated. e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler The process repeats until the scheduler is empty.

23 The GraphLab Framework Scheduler Consistency Model Graph Based Data Representation Update Functions User Computation 23

24 Ensuring Race-Free Code How much can computation overlap?

25 GraphLab Ensures Sequential Consistency 25 For each parallel execution, there exists a sequential execution of update functions which produces the same result. CPU 1 CPU 2 Single CPU Single CPU Parallel Sequential time

26 Consistency Rules 26 Guaranteed sequential consistency for all update functions Data

27 Full Consistency 27

28 Obtaining More Parallelism 28

29 Edge Consistency 29 CPU 1 CPU 2 Safe Read

30 Consistency Through R/W Locks Read/Write locks: Full Consistency Edge Consistency Write Canonical Lock Ordering ReadWrite Read Write

31 Carnegie Mellon GraphLab for Natural Graphs the Achilles heel

32 Problem: High Degree Vertices Graphs with high degree vertices are common: Power-Law Graphs (Social Networks): Affects algorithms like label-propagation Probabilistic Graphical Models: Hyper-parameters which couple large sets of data Connectivity structure induced by natural phenomena High degree vertices kill parallelism: Pull a Large Amount of State Requires Heavy Locking Processed Sequentially

33 Proposed Solutions Decomposable Update Functors Expose greater parallelism by further factoring update functions Abelian Group Caching (concurrent revisions) Allows for controllable races through diff operations Stochastic Scopes Reduce degree through sampling

34 Associative Commutative Update Functors Associative commutative functions which: i.e., 1 + (2 + 3) Subsumes Message Passing, Traveler Model, and Pregel (Giraph) Function composition equivalent to combiner in Pregel LabelPropUF(i, scope, Δ ) // Get Neighborhood data (OldLikes[i], Likes[i], W ij )  scope; // Update the vertex data Likes[i]  Likes[i] + Δ ; // Reschedule Neighbors if needed if |Likes[i] – OldLikes[i]| > ε then for all neighbors j Δ  (Likes[i] – OldLikes[i]) * W ij ; reschedule(j, Δ ); OldLikes[i]  Likes[i] Update Functor composition: Achieved by defining an operator+ for update functors Update functors composition may be computed in parallel

35 Advantages of Update Functors Eliminates the need to read neighbor values Reduces locking overhead Enables task de-duplication and hinting: User defined (+) can choose to eliminate or compose Dynamically control execution parameters: priority(), consistency_level(), is_decomposable() … Needed to to enable further extensions Decomposable update functors etc… Simplifies schedulers: Many update functions  single update functor

36 Disadvantages of Update Functors Typically constructed sequentially Requires access to edge weights Problems exist in Pregel/Giraph as well Changed required substantial engineering effort  Affected 2/3 of the GraphLab Library lpF(i, scope, Δ) // Update the vertex data Likes[i]  Likes[i] + Δ; // Reschedule Neighbors if needed if |Likes[i] – OldLikes[i]| > ε then for all neighbors j Δ  (Likes[i] – OldLikes[i]) * W ij ; reschedule(j, Δ); OldLikes[i]  Likes[i]

37 Carnegie Mellon Decomposable Update Functors Breaking computation over edges of the graph.

38 Decomposable Update Functors Decompose update functions into 3 phases: Locks are acquired only for region within a scope  Relaxed Consistency + + … +  Δ Y Y Y Parallel Sum User Defined: Gather( )  Δ Y Δ 1 + Δ 2  Δ 3 Y Scope Gather Y Y Apply(, Δ)  Y Apply the accumulated value to center vertex User Defined: Apply Y Scatter( )  Update adjacent edges and vertices. User Defined: Y Scatter

39 Implementing Label Propagation with Factorized Update Functions: Implemented using update functors state as accumulator: Uses same (+) operator to merge partial gathers: Gather(i,j, scope) // Get Neighborhood data (Likes[i], W ij, Likes[j]) // Emit accumulator emit W ij x Likes[j] as Δ; Apply(i, scope, Δ){ // Get Neighborhood (Likes[i])  scope; // Update the vertex Likes[i]  Δ; } Scatter(i, scope){ // Get Neighborhood data (Likes[i], W ij, Likes[j])  scope; // Reschedule if changed if Likes[i] changed then reschedule(j); } Δ 1 + Δ 2  Δ 3 Decomposable Update Functors Y F1F1 F1F1 F2F2 F2F2 F1F1 F1F1 F2F2 F2F2 ( + )( ) Y

40 Decomposable Functors Fits many algorithms Loopy Belief Propagation, Label Propagation, PageRank… Addresses the earlier concerns Problem: High degree vertices will still get locked frequently and will need to be up-to-date Large State Distributed Gather and Scatter Heavy Locking Fine Grained Locking Sequential Parallel Gather and Scatter

41 Abelian Group Caching Enabling eventually consistent data races

42 Abelian Group Caching Issue: All earlier methods maintain a sequentially consistent view of data across all processors. Proposal: Try to split data instead of computation. How can we split the graph without changing the update function?

43 Insight from WSDM paper Answer: Allow Eventually Consistent data races High degree vertices admit slightly “stale” values: Changes in a few elements  negligible effect High degree vertex updates typically a form of “sum” operation which has an “inverse” Example: Counts, Averages, Sufficient statistics Counter Example: Max Goal: Lazily synchronize duplicate data Similar to a version control system Intermediate values partially consistent Final value at termination must be consistent

44 Example Every processor initial has a copy of the same central value: 10 Processor 1Processor 2Processor 3 Master Current

45 Example Each processor makes a small change to its value: 10 Processor 1Processor 2Processor 3 Master 11 Old 7 13 Old True Value: 10 + 1 - 3 + 3 = 11 Current

46 Example Send delta values (Diffs) to the master: 10 Processor 1Processor 2Processor 3 Master 1 -3 True Value: 10 + 1 - 3 + 3 = 11 10 11 Old 7 13 OldCurrent

47 Example Send delta values (Diffs) to the master: 10 Processor 1Processor 2Processor 3 Master 1 -3 True Value: 10 + 1 - 3 + 3 = 11 1171013 OldCurrent

48 Example Send delta values (Diffs) to the master: 8 8 Processor 1Processor 2Processor 3 Master 1 -3 True Value: 10 + 1 - 3 + 3 = 11 1171013 OldCurrent

49 Example Master is consistent with first two processors changes 8 8 1171013 Old Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 Current

50 Example Master decides to refresh other processors 8 1171013 Old Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 8 8 8 8 8 Current

51 Example Master decides to refresh other processors 8 881013 Old Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 8 8 8 Current

52 Example Master decides to refresh other processors 8 881013 Old Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 8 8 8 +3 Current

53 Example Master decides to refresh other processors 8 8888+3 Old Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 8 8 8 +3 Current

54 Example Master decides to refresh other processors 8 88811 Old Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 8 8 Current

55 Example Processor 3 decides to update the master 8 88811 Old Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 8 8 3 Current

56 Example Processor 3 decides to update the master 8 8811 Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 8 8 3 Current

57 Example Processor 3 decides to update the master 11 88 Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 3 Current

58 Example Master is globally consistent: 11 88 Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 Current

59 Example Master is globally consistent: 11 88 Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 Current 11

60 Example Master is globally consistent: 11 Processor 1Processor 2Processor 3 Master True Value: 10 + 1 - 3 + 3 = 11 Current 11

61 Abelian Group Caching Abelian Group Caching: Data must have a commutative (+) and inverse (-) operation In GraphLab we have encountered many applications with the following bipartite form: Data Parameter Bounded Low Degree Parameter High Degree (Power Law)

62 Abelian Group Caching Abelian Group Caching: Data must have a commutative (+) and inverse (-) operation In GraphLab we have encountered many applications with the following bipartite form: Clustering Topic models Lasso … Data Parameter

63 Caching Replaces Locks Instead of locking cache entries are created: Each processor maintains a LRU vertex data cache Locks are acquired in parallel and only on a cache miss User must define (+) and (-) operations for vertex data Simpler: vdata.apply_diff(new_vdata, old_vdata) User specifies maximum allowable “staleness” Works with existing update functions/functors Cach e

64 Hierarchical Caching The caching strategy can be composed across varying latency systems: Rack 1 Cache Distributed Hash Table of Masters Rack 2 Cache System Cache Thread Cache Cache Resolution

65 Hierarchical Caching Current implementation uses two tiers: Distributed Hash Table of Masters System Cache Cache Resolution Thread Cache

66 Contention Based Caching Idea: Only use cache strategy when a lock is in frequently contention Tested on LDA and PageRank Reduces the effective cache size Works under LDA Try_LockLock and Cache Fail Use true dataUsed Cached Copy

67 Created a New Library! Abelian cached distributed hash table: Should be running on the grid soon int main(int argc, char** argv) { dc_init_param rpc_parameters; init_param_from_env(rpc_parameters); distributed_control dc(rpc_parameters); delta_dht tbl(dc, 100); tbl[“hello”] += 1.0; tbl[“world”] -= 3.0; tbl.synchronize(“world”); std::cout << tbl[“hello”] << std::endl; } Initialize system using Hadoop Friendly TCP connections Create an Abelian Cached Map Add entries Read values

68 Stochastic Scopes Bounded degree through sampling

69 Stochastic Scopes Idea: Can we “sample” the neighborhood of a vertex Randomly sample neighborhood of fixed size: Currently only supports uniform sampling Will likely need weighted sampling Need to develop theory of stochastic scopes in learning algorithms label_prop(i, scope, p){ // Get Neighborhood data // Update the vertex data // Reschedule Neighbors if needed } Randomly construct a sample scope lock all selected neighbors

70 EARLY EXPERIMENT

71 Implemented LDA in GraphLab Used collapsed Gibbs sampling for LDA as test App. GraphLab Formulation: Doc 1 Doc 2 Doc 3 Word A Word B Word C Word D {#[w,d,t], #[w,d]} #[d,t] #[w,t] #[t]Global Variable:

72 Implemented LDA in GraphLab Used collapsed Gibbs sampling for LDA as test App. GraphLab Formulation: Doc 1 Doc 2 Doc 3 Word A Word B Word C Word D {#[w,d,t], #[w,d]} #[d,t] #[w,t] #[t]Global Variable: Resample #[w,d,t] using: Update: #[d,t], #[w,d,t], #[w,t]

73 Implemented LDA in GraphLab Used collapsed Gibbs sampling for LDA as test App. GraphLab Formulation: Doc 1 Doc 2 Doc 3 Word A Word B Word C Word D {#[w,d,t], #[w,d]} #[d,t] #[w,t] #[t]Global Variable: Resample #[w,d,t] using: Update: #[d,t], #[w,d,t], #[w,t]

74 GraphLab LDA Scaling Curves Factorized is close to exact parallel Gibbs sampling! Only uses “stale” topic counts #[t] Cached system with 2 update lag Need to evaluate lag effect on convergence

75 Other Preliminary Observations Pagerank on Y!Messenger Friend Network 14x speedup (on 16 cores) using new approaches GraphLab 12x speedup (on 16 cores) using original GraphLab? I suspect an inefficiency in functor composition is “improving” scaling LDA over new DHT data-structures Appears to scale linearly on small 4x machine deployments Keep’s cache relative fresh (2-3 update Lag) Needs more evaluation! Needs system optimization

76 Summary and Future Work We have begun to address the challenge of Natural Graphs After substantial engineering effort we have: Decomposable Update Functors Abelian Group Caching: Eventual consistency Stochastic Scopes Plan to evaluate on following applications LDA (both collapsed Gibbs and CVB0) Probabilistic Matrix Factorization Loopy BP on Markov Logic Networks Label Propagation on Social Networks

77 Carnegie Mellon Checkout GraphLab http://graphlab.org 77 NIPs Workshop on Large-Scale Learning Questions & Feedback jegonzal@cs.cmu.edu http://biglearn.org

78 GraphLab LDA Scaling Curves

79 Carnegie Mellon Problems with GraphLab

80 Problems with the Data Graph How is the Data Graph constructed? Sequentially and in physical memory by the user graph.add_vertex(vertex_data)  vertex_id; graph.add_edge(source_id, target_id, edge_data); In parallel using a complex binary file format Graph Atoms: Fragments of the Graph How is the Data Graph stored between runs? By the user in a distributed file-system No notion of locality No convenient tools to read the output of GraphLab No out-of-core storage  limit size of graphs 80

81 Solution: Hadoop/HDFS DataGraph Graph Construction and Storage using Hadoop: Developed a simple AVRO graph file format Implemented a reference AVRO graph constructor in Hadoop. Automatically sorts records for fast locking Simplifies computing edge reversal maps Tested on a subset of the twitter data-set Hadoop/HDFS manages launching and post-processing Hadoop streaming assigns graph fragments Output of GraphLab can be processed in Hadoop Problem: Waiting on C++ ScopeRecord { ID vertexId; VDataRecord vdata; List NeighborsIds; List EdgeData; } ScopeRecord { ID vertexId; VDataRecord vdata; List NeighborsIds; List EdgeData; }

82 Out-of-core Storage Problem: What if graph doesn’t fit in memory? Solution: Disk based caching. Only completed design specification Collaborator is writing a mem-cached file-system In Physical Memory Local Scope Map: Local VertexId  File Offset Out-of-Core Storage Scope Record Vdata EdataVdata Adj. Lists EdataVdataAdj. List Local Vertex Locks EdataAdjacency Lists Scope Record DHT Distributed Map: VertexId  Owning Instance Remote Storage Object Cache Fail


Download ppt "Joseph Gonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola The Next."

Similar presentations


Ads by Google