Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

epiC: an Extensible and Scalable System for Processing Big Data

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Felix Halim, Roland H.C. Yap, Yongzheng Wu

Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Distributed Computations MapReduce

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS

Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

HAMS Technologies 1

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Next Generation of Apache Hadoop MapReduce Owen

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Intro to Parallel and Distributed Processing Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Image taken from: slideshare

Big Data is a Big Deal!.

Tutorial: Big Data Algorithms and Applications Under Hadoop

Hadoop MapReduce Framework

Parallel Programming By J. H. Wang May 2, 2017.

Central Florida Business Intelligence User Group

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce: Data Distribution for Reduce

CS110: Discussion about Spark

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Charles Tappert Seidenberg School of CSIS, Pace University

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Learning with Hadoop – A case study on MapReduce based Data Mining

Presentation transcript:

Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 2

Introduction to Hadoop  Hadoop Map/Reduce is  a java based software framework for easily writing applications  which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware  in a reliable, fault-tolerant manner. 3

Job submission node Slave node TaskTrackerDataNode HDFS master JobTrackerNameNode Slave node TaskTrackerDataNode Slave node TaskTrackerDataNode Client Hadoop Cluster Architecture 4From Jimmy Lin’s slides

Hadoop HDFS 5

Hadoop Cluster Rack Awareness 6

Hadoop Development Cycle Hadoop Cluster You 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 5. Move data out of HDFS 6. Scp data from cluster 7From Jimmy Lin’s slides

Divide and Conquer “Work” w1w1 w1w1 w2w2 w2w2 w3w3 w3w3 r1r1 r1r1 r2r2 r2r2 r3r3 r3r3 “Result” “worker” Partition Combine 8From Jimmy Lin’s slides

High-level MapReduce pipeline 9

Detailed Hadoop MapReduce data flow 10

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 11

Word Count with MapReduce 1 1 one two fish 2 2 one fish, two fish Doc red blue fish 2 2 red fish, blue fish Doc cat hat 1 1 cat in the hat Doc fish one two red cat blue hat 1 1 Shuffle and Sort: aggregate values by keys Map Reduce 12From Jimmy Lin’s slides

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 13

14 Calculating document pairwise similarity  Trivial Solution  load each vector o(N) times  load each term o(df t 2 ) times scalable and efficient solution for large collections Goal From Jimmy Lin’s slides

15 Better Solution  Load weights for each term once  Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in From Jimmy Lin’s slides

16 reduce Decomposition  Load weights for each term once  Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map From Jimmy Lin’s slides

17 Standard Indexing tokenize tokenize tokenize tokenize combine combine combine doc posting list Shuffling group values by: terms (a) Map (b) Shuffle (c) Reduce From Jimmy Lin’s slides

Inverted Indexing with MapReduce 1 1 one two fish 2 2 one fish, two fish Doc red blue fish 2 2 red fish, blue fish Doc cat hat 1 1 cat in the hat Doc fish one two red cat blue hat 1 1 Shuffle and Sort: aggregate values by keys Map Reduce 18From Jimmy Lin’s slides

19 Indexing (3-doc toy collection) Clinton Barack Cheney Obama Indexing Clinton Obama Clinton 1 1 Clinton Cheney Clinton Barack Obama Clinton Obama Clinton Cheney Clinton Barack Obama From Jimmy Lin’s slides

20 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton Barack Cheney Obama How to deal with the long list? From Jimmy Lin’s slides

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 21

PageRank  PageRank – an information propagation model Intensive access of neighborhood list 22

PageRank with MapReduce n 5 [n 1, n 2, n 3 ]n 1 [n 2, n 4 ]n 2 [n 3, n 5 ]n 3 [n 4 ]n 4 [n 5 ] n2n2 n4n4 n3n3 n5n5 n1n1 n2n2 n3n3 n4n4 n5n5 n2n2 n4n4 n3n3 n5n5 n1n1 n2n2 n3n3 n4n4 n5n5 n 5 [n 1, n 2, n 3 ]n 1 [n 2, n 4 ]n 2 [n 3, n 5 ]n 3 [n 4 ]n 4 [n 5 ] Map Reduce How to maintain the graph structure? From Jimmy Lin’s slides

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 24

K-Means Clustering 25

K-Means Clustering with MapReduce 26 Mapper_iMapper_i-1Mapper_i+1 Reducer_iReducer_i+1Reducer_i-1 How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering.Canopy Clustering Each Mapper loads a set of data samples, and assign each sample to a nearest centroid Each Mapper needs to keep a copy of centroids [McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 27

Matrix Factorization for Link Prediction  In this task, we observe a sparse matrix X ∈ R m×n with entries x ij. Let R = {(i,j,r): r = x ij, where x ij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U ∈ R k×m and an item factor matrix V ∈ R k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing: 28

 Given X and V, updating U:  Similarly, given X and U, we can alternatively update V Solving Matrix Factorization via Alternative Least Squares 29 X m n uiui V k n A kk k k k k b k

MapReduce for ALS 30 Mapper_i Reducer_i Mapper_i Reducer_i Group rating data in X using for item j Group features in V using for item j Align ratings and features for item j, and make a copy of V j for each observe x ij Stage 1Stage 2 Rating for item j Features for item j i-1VjVj iVjVj i+1VjVj Group rating data in X using for user i iVjVj iV j+2 i+1VjVj Standard ALS: Calculate A and b, and update U i

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 31

Cluster Coefficient 32  In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network. How to maintain the Tier-2 neighbors? [D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440–442]

Cluster Coefficient with MapReduce 33 Mapper_i Reducer_i Mapper_i Reducer_i Stage 1Stage 2 Calculate the cluster coefficient BFS based method need three stages, but actually we only need two!

Resource Entries to ML labs  Mahout Mahout  Apache’s scalable machine learning libraries  Jimmy Lin’s Lab Jimmy Lin  iSchool at the University of Maryland  Jimeng Sun & Yan Rong’s Collections Jimeng SunYan Rong  IBM TJ Watson Research Center  Edward Chang & Yi Wang Edward ChangYi Wang  Google Beijing 34

Advanced Topics in Machine Learning with MapReduce 35  Probabilistic Graphical models  Gradient based optimization methods  Graph Mining  Others…

Some Advanced Tips  Design your algorithm with a divide and conquer manner  Make your functional units loosely dependent  Carefully manage your memory and disk storage  Discussions… 36

Outline  Hadoop Basics  Case Study  Word Count  Pairwise Similarity  PageRank  K-Means Clustering  Matrix Factorization  Cluster Coefficient  Resource Entries to ML labs  Advanced Topics  Q&A 37

Q&A  Why not MPI?  Hadoop is Cheap in everything…D.P.T.H…  What’s the advantages of Hadoop?  Scalability!  How do you guarantee the model equivalence?  Guarantee equivalent/comparable function logics  How can you beat “large memory” solution?  Clever use of Sequential Disk Access 38