Presentation is loading. Please wait.

Presentation is loading. Please wait.

SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.

Similar presentations


Presentation on theme: "SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos."— Presentation transcript:

1 SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric Xing

2 Outline 1. When to use SGD for distributed learning 2. Optimization Review of DSGD SGD for Tensors SGD for ML models – topic modeling, dictionary learning, MMSB 3. Hadoop 1. General algorithm 2. Setting up the MapReduce body 3. Reducer communication 4. Distributed normalization 5. “Always-On SGD” – How to deal with the straggler problem 4. Experiments

3 When distributed SGD is useful Collaborative Filtering Predict movie preferences Topic Modeling What are the topics of webpages, tweets, or status updates Dictionary Learning Remove noise or missing pixels from images Tensor Decomposition Find communities in temporal graphs 300 Million Photos uploaded to Facebook per day! 1 Billion users on Facebook 400 million tweets per day

4 Gradient Descent

5 Stochastic Gradient Descent (SGD)

6

7 SGD Background

8 DSGD for Matrices (Gemulla, 2011) X U V ≈ Users Movies Genres

9 DSGD for Matrices (Gemulla, 2011) X U V ≈ Independent!

10 DSGD for Matrices (Gemulla, 2011) Independent Blocks

11 DSGD for Matrices (Gemulla, 2011) Partition your data & model into d × d blocks Results in d=3 strata Process strata sequentially, process blocks in each stratum in parallel

12 TENSORS

13 What is a tensor? Tensors are used for structured data > 2 dimensions Think of as a 3D-matrix Subject Verb Object For example: Derek Jeter plays baseball

14 Tensor Decomposition ≈ U V W X

15 ≈ U V W X

16 ≈ U V W X

17 ≈ U V W X Independent Not Independent

18 Tensor Decomposition

19 For d=3 blocks per stratum, we require d 2 =9 strata

20 Tensor Decomposition

21 Coupled Matrix + Tensor Decomposition X Y Subject Verb Object Document

22 Coupled Matrix + Tensor Decomposition ≈ U V W X Y A

23

24 CONSTRAINTS & PROJECTIONS

25 Example: Topic Modeling Documents Words Topics

26 Constraints Sometimes we want to restrict response: Non-negative Sparsity Simplex (so vectors become probabilities) Keep inside unit ball

27 How to enforce? Projections Example: Non-negative

28 More projections Sparsity (soft thresholding): Simplex Unit ball

29 Dictionary Learning Learn a dictionary of concepts and a sparse reconstruction Useful for fixing noise and missing pixels of images Sparse encoding Within unit ball

30 Mixed Membership Network Decomp. Used for modeling communities in graphs (e.g. a social network) Simplex Non-negative

31 IMPLEMENTING ON HADOOP

32 High level algorithm for Epoch e = 1 … T do for Subepoch s = 1 … d 2 do Let be the set of blocks in stratum s for block b = 1 … d in parallel do Run SGD on all points in block end Stratum 1 Stratum 2 Stratum 3 …

33 Bad Hadoop Algorithm: Subepoch 1 Run SGD on Update: Run SGD on Update: Run SGD on Update: ReducersMappers U2U2 V1V1 W3W3 U3U3 V2V2 W1W1 U1U1 V3V3 W2W2

34 Bad Hadoop Algorithm: Subepoch 2 Run SGD on Update: Run SGD on Update: Run SGD on Update: ReducersMappers U2U2 V1V1 W2W2 U3U3 V2V2 W3W3 U1U1 V3V3 W1W1

35 Bad Hadoop Algorithm U2U2 V1V1 W3W3 Run SGD on Update: U3U3 V2V2 W1W1 Run SGD on Update: U1U1 V3V3 W2W2 Run SGD on Update: ReducersMappers

36 Hadoop Challenges MapReduce is typically very bad for iterative algorithms T × d 2 iterations Sizable overhead per Hadoop job Little flexibility

37 High Level Algorithm V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W1W1 U2U2 V2V2 W2W2 U3U3 V3V3 W3W3

38 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W1W1 U2U2 V2V2 W2W2 U3U3 V3V3 W3W3

39 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W3W3 U2U2 V2V2 W1W1 U3U3 V3V3 W2W2

40 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W2W2 U2U2 V2V2 W3W3 U3U3 V3V3 W1W1

41 Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers Partition & Sort … … HDFS

42 Hadoop Algorithm Process points: Map each point to its block with necessary info to order Reducers Mappers Partition & Sort Use: Partitioner KeyComparator GroupingComparator

43 Hadoop Algorithm Process points: Map each point to its block with necessary info to order Reducers Mappers Partition & Sort … …

44 Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers … … Partition & Sort

45 Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers Partition & Sort … …

46 Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 Run SGD on Update: U2U2 V2V2 Run SGD on Update: U3U3 V3V3 Run SGD on Update: Reducers Mappers Partition & Sort … … HDFS W2W2 W1W1 W3W3

47 Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers Partition & Sort … … HDFS

48 Hadoop Summary 1. Use mappers to send data points to the correct reducers in order 2. Use reducers as machines in a normal cluster 3. Use HDFS as the communication channel between reducers

49 Distributed Normalization Documents Words Topics π1π1 β1β1 π2π2 β2β2 π3π3 β3β3

50 Distributed Normalization π1π1 β1β1 π2π2 β2β2 π3π3 β3β3 σ (1) σ (2) σ (3) σ (b) is a k-dimensional vector, summing the terms of β b σ (1) σ (3) σ (2) Transfer σ (b) to all machines Each machine calculates σ: Normalize:

51 Barriers & Stragglers Process points: Map each point to its block with necessary info to order Run SGD on Reducers Mappers Partition & Sort … … U1U1 V1V1 Update: U2U2 V2V2 U3U3 V3V3 HDFS W2W2 W1W1 W3W3 Wasting time waiting!

52 Solution: “Always-On SGD” For each reducer: Run SGD on all points in current block Z Shuffle points in Z and decrease step size Check if other reducers are ready to sync Run SGD on points in Z again If not ready to sync Wait If not ready to sync Sync parameters and get new block Z

53 “Always-On SGD” Process points: Map each point to its block with necessary info to order Run SGD on Reducers Partition & Sort … … U1U1 V1V1 Update: U2U2 V2V2 U3U3 V3V3 HDFS W2W2 W1W1 W3W3 Run SGD on old points again!

54 “Always-On SGD” First SGD pass of block Z Extra SGD Updates Read Parameters from HDFS Write Parameters to HDFS Reducer 1 Reducer2 Reducer 3 Reducer 4

55 EXPERIMENTS

56 FlexiFaCT (Tensor Decomposition) Convergence

57 FlexiFaCT (Tensor Decomposition) Scalability in Data Size

58 FlexiFaCT (Tensor Decomposition) Scalability in Tensor Dimension Handles up to 2 billion parameters!

59 FlexiFaCT (Tensor Decomposition) Scalability in Rank of Decomposition Handles up to 4 billion parameters!

60 FlexiFaCT (Tensor Decomposition) Scalability in Number of Machines

61 Fugue (Using “Always-On SGD”) Dictionary Learning: Convergence

62 Fugue (Using “Always-On SGD”) Community Detection: Convergence

63 Fugue (Using “Always-On SGD”) Topic Modeling: Convergence

64 Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Data Size

65 Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Rank

66 Fugue (Using “Always-On SGD”) Topic Modeling: Scalability over Machines

67 Fugue (Using “Always-On SGD”) Topic Modeling: Number of Machines

68 Fugue (Using “Always-On SGD”)

69 Key Points Flexible method for tensors & ML models Can use stock Hadoop through using HDFS for communication When waiting for slower machines, run updates on old data again

70 Questions? Alex Beutel abeutel@cs.cmu.edu http://alexbeutel.com


Download ppt "SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos."

Similar presentations


Ads by Google