Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Similar presentations


Presentation on theme: "Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1."— Presentation transcript:

1 Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

2 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 2

3 Introduction to Hadoop  Hadoop Map/Reduce is  a java based software framework for easily writing applications  which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware  in a reliable, fault-tolerant manner. 3

4 Job submission node Slave node TaskTrackerDataNode HDFS master JobTrackerNameNode Slave node TaskTrackerDataNode Slave node TaskTrackerDataNode Client Hadoop Cluster Architecture 4From Jimmy Lin’s slides

5 Hadoop HDFS 5

6 Hadoop Cluster Rack Awareness 6

7 Hadoop Development Cycle Hadoop Cluster You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster 7From Jimmy Lin’s slides

8 Divide and Conquer “Work” w1w1 w1w1 w2w2 w2w2 w3w3 w3w3 r1r1 r1r1 r2r2 r2r2 r3r3 r3r3 “Result” “worker” Partition Combine 8From Jimmy Lin’s slides

9 High-level MapReduce pipeline 9

10 Detailed Hadoop MapReduce data flow 10

11 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 11

12 Word Count with MapReduce 1 1 one 1 1 1 1 two 1 1 1 1 fish 2 2 one fish, two fish Doc 1 2 2 red 1 1 2 2 blue 1 1 2 2 fish 2 2 red fish, blue fish Doc 2 3 3 cat 1 1 3 3 hat 1 1 cat in the hat Doc 3 1 1 fish 4 4 1 1 one 1 1 1 1 two 1 1 2 2 red 1 1 3 3 cat 1 1 2 2 blue 1 1 3 3 hat 1 1 Shuffle and Sort: aggregate values by keys Map Reduce 12From Jimmy Lin’s slides

13 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 13

14 14 Calculating document pairwise similarity  Trivial Solution  load each vector o(N) times  load each term o(df t 2 ) times scalable and efficient solution for large collections Goal From Jimmy Lin’s slides

15 15 Better Solution  Load weights for each term once  Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in From Jimmy Lin’s slides

16 16 reduce Decomposition  Load weights for each term once  Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map From Jimmy Lin’s slides

17 17 Standard Indexing tokenize tokenize tokenize tokenize combine combine combine doc posting list Shuffling group values by: terms (a) Map (b) Shuffle (c) Reduce From Jimmy Lin’s slides

18 Inverted Indexing with MapReduce 1 1 one 1 1 1 1 two 1 1 1 1 fish 2 2 one fish, two fish Doc 1 2 2 red 1 1 2 2 blue 1 1 2 2 fish 2 2 red fish, blue fish Doc 2 3 3 cat 1 1 3 3 hat 1 1 cat in the hat Doc 3 1 1 fish 2 2 2 2 2 2 1 1 one 1 1 1 1 two 1 1 2 2 red 1 1 3 3 cat 1 1 2 2 blue 1 1 3 3 hat 1 1 Shuffle and Sort: aggregate values by keys Map Reduce 18From Jimmy Lin’s slides

19 19 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing 2 1 1 1 1 Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Cheney Clinton Barack Obama From Jimmy Lin’s slides

20 20 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama 2 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 3 1 How to deal with the long list? From Jimmy Lin’s slides

21 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 21

22 PageRank  PageRank – an information propagation model Intensive access of neighborhood list 22

23 PageRank with MapReduce n 5 [n 1, n 2, n 3 ]n 1 [n 2, n 4 ]n 2 [n 3, n 5 ]n 3 [n 4 ]n 4 [n 5 ] n2n2 n4n4 n3n3 n5n5 n1n1 n2n2 n3n3 n4n4 n5n5 n2n2 n4n4 n3n3 n5n5 n1n1 n2n2 n3n3 n4n4 n5n5 n 5 [n 1, n 2, n 3 ]n 1 [n 2, n 4 ]n 2 [n 3, n 5 ]n 3 [n 4 ]n 4 [n 5 ] Map Reduce How to maintain the graph structure? From Jimmy Lin’s slides

24 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 24

25 K-Means Clustering 25

26 K-Means Clustering with MapReduce 26 Mapper_iMapper_i-1Mapper_i+1 Reducer_iReducer_i+1Reducer_i-1 How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering.Canopy Clustering Each Mapper loads a set of data samples, and assign each sample to a nearest centroid Each Mapper needs to keep a copy of centroids 1342 132 142 1342 4 3 2 34 324 [McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]

27 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 27

28 Matrix Factorization for Link Prediction  In this task, we observe a sparse matrix X ∈ R m×n with entries x ij. Let R = {(i,j,r): r = x ij, where x ij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U ∈ R k×m and an item factor matrix V ∈ R k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing: 28

29  Given X and V, updating U:  Similarly, given X and U, we can alternatively update V Solving Matrix Factorization via Alternative Least Squares 29 X m n uiui V k n A kk k k k k b k

30 MapReduce for ALS 30 Mapper_i Reducer_i Mapper_i Reducer_i Group rating data in X using for item j Group features in V using for item j Align ratings and features for item j, and make a copy of V j for each observe x ij Stage 1Stage 2 Rating for item j Features for item j i-1VjVj iVjVj i+1VjVj Group rating data in X using for user i iVjVj iV j+2 i+1VjVj Standard ALS: Calculate A and b, and update U i

31 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 31

32 Cluster Coefficient 32  In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network. How to maintain the Tier-2 neighbors? [D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440–442]

33 Cluster Coefficient with MapReduce 33 Mapper_i Reducer_i Mapper_i Reducer_i Stage 1Stage 2 Calculate the cluster coefficient BFS based method need three stages, but actually we only need two!

34 Resource Entries to ML labs  Mahout Mahout  Apache’s scalable machine learning libraries  Jimmy Lin’s Lab Jimmy Lin  iSchool at the University of Maryland  Jimeng Sun & Yan Rong’s Collections Jimeng SunYan Rong  IBM TJ Watson Research Center  Edward Chang & Yi Wang Edward ChangYi Wang  Google Beijing 34

35 Advanced Topics in Machine Learning with MapReduce 35  Probabilistic Graphical models  Gradient based optimization methods  Graph Mining  Others…

36 Some Advanced Tips  Design your algorithm with a divide and conquer manner  Make your functional units loosely dependent  Carefully manage your memory and disk storage  Discussions… 36

37 Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 37

38 Q&A  Why not MPI?  Hadoop is Cheap in everything…D.P.T.H…  What’s the advantages of Hadoop?  Scalability!  How do you guarantee the model equivalence?  Guarantee equivalent/comparable function logics  How can you beat “large memory” solution?  Clever use of Sequential Disk Access 38


Download ppt "Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1."

Similar presentations


Ads by Google