Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted.

Similar presentations


Presentation on theme: "Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted."— Presentation transcript:

1 Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted from IR course lectures by Jamie Callan © 2010, Le Zhao 1

2 Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 2

3 Outline Why MapReduce (Hadoop) –Why go large scale –Compared to other parallel computing models –Hadoop related tools MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 3

4 Why NOT to do parallel computing Concerns: a parallel system needs to provide: –Data distribution –Computation distribution –Fault tolerance –Job scheduling © 2010, Le Zhao 4

5 Why MapReduce (Hadoop) Previous parallel computation models –1) scp + ssh »Manual everything –2) network cross-mounted disks + condor/torque »No data distr, disk access is bottleneck »Can only partition totally distributed computation »No fault tolerance »Prioritized job scheduling © 2010, Le Zhao 5

6 Hadoop Parallel batch computation –Data distribution »Hadoop Distributed File System (HDFS) »Like Linux FS, but with automatic data repetition –Computation distribution »Automatic, user only need to specify #input_splits »Can distribute aggregation computations as well –Fault tolerance »Automatic recovery from failure »Speculative execution (a backup task) –Job scheduling »Ok, but still relies on the politeness of users © 2010, Le Zhao 6

7 How you can use Hadoop Hadoop Streaming –Quick hacking – much like shell scripting »Uses STDIN & STDOUT carry data »cat file | mapper | sort | reducer > output –Easier to use legacy code, all programming languages Hadoop Java API –Build large systems »More data types »More control over Hadoop’s behavior »Easier debugging with Java’s error stacktrace display –NetBeans plugin for Hadoop provides easy programming »http://hadoopstudio.org/docs.html © 2010, Le Zhao 7

8 Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 8

9 © 2009, Jamie Callan 9 Map and Reduce MapReduce is a new use of an old idea in Computer Science Map: Apply a function to every object in a list –Each object is independent »Order is unimportant »Maps can be done in parallel –The function produces a result Reduce: Combine the results to produce a final result You may have seen this in a Lisp or functional programming course

10 © 2010, Jamie Callan 10 MapReduce Input reader –Divide input into splits, assign each split to a Map processor Map –Apply the Map function to each record in the split –Each Map function returns a list of (key, value) pairs Shuffle/Partition and Sort –Shuffle distributes sorting & aggregation to many reducers –All records for key k are directed to the same reduce processor –Sort groups the same keys together, and prepares for aggregation Reduce –Apply the Reduce function to each key –The result of the Reduce function is a list of (key, value) pairs

11 MapReduce in One Picture © 2010, Le Zhao 11 Tom White, Hadoop: The Definitive Guide

12 Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking –Two simple use cases –Two more advanced & useful MapReduce tricks –Two MapReduce applications Manipulating large data © 2010, Le Zhao 12

13 MapReduce Use Case (1) – Map Only Data distributive tasks – Map Only E.g. classify individual documents Map does everything –Input: (docno, doc_content), … –Output: (docno, [class, class, …]), … No reduce © 2010, Le Zhao 13

14 MapReduce Use Case (2) – Filtering and Accumulation Filtering & Accumulation – Map and Reduce E.g. Counting total enrollments of two given classes Map selects records and outputs initial counts –In: (Jamie, 11741), (Tom, 11493), … –Out: (11741, 1), (11493, 1), … Shuffle/Partition by class_id Sort –In: (11741, 1), (11493, 1), (11741, 1), … –Out: (11493, 1), …, (11741, 1), (11741, 1), … Reduce accumulates counts –In: (11493, [1, 1, …]), (11741, [1, 1, …]) –Sum and Output: (11493, 16), (11741, 35) © 2010, Le Zhao 14

15 MapReduce Use Case (3) – Database Join Problem: Massive lookups –Given two large lists: (URL, ID) and (URL, doc_content) pairs –Produce (ID, doc_content) Solution: Database join Input stream: both (URL, ID) and (URL, doc_content) lists –(http://del.icio.us/post, 0), (http://digg.com/submit, 1), … –(http://del.icio.us/post, ), (http://digg.com/submit, ), … Map simply passes input along, Shuffle and Sort on URL (group ID & doc_content for the same URL together) –Out: (http://del.icio.us/post, 0), (http://del.icio.us/post, ), (http://digg.com/submit, ), (http://digg.com/submit, 1), … Reduce outputs result stream of (ID, doc_content) pairs –In: (http://del.icio.us/post, [0, html0]), (http://digg.com/submit, [html1, 1]), … –Out: (0, ), (1, ), … © 2010, Le Zhao 15

16 MapReduce Use Case (4) – Secondary Sort Problem: Sorting on values E.g. Reverse graph edge directions & output in node order –Input: adjacency list of graph (3 nodes and 4 edges) (3, [1, 2]) (1, [3]) (1, [2, 3])  (2, [1, 3]) (3, [1]) Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys! Solution: Secondary sort Map –In: (3, [1, 2]), (1, [2, 3]). –Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction) –Out: (, [3]), (, [3]), (, [1]), (, [1]). –Copy node_ids from value to key. 12 3 12 3  © 2010, Le Zhao 16

17 MapReduce Use Case (4) – Secondary Sort Secondary Sort (ctd.) Shuffle on Key.field1, and Sort on whole Key (both fields) –In: (, [3]), (, [3]), (, [1]), (, [1]) –Out: (, [3]), (, [1]), (, [3]), (, [1]) Grouping comparator –Merge according to part of the key –Out: (, [3]), (, [1, 3]), (, [1]) this will be the reducer’s input Reduce –Merge & output: (1, [3]), (2, [1, 3]), (3, [1]) © 2010, Le Zhao 17

18 Using MapReduce to Construct Indexes: Preliminaries Construction of binary inverted lists Input: documents: (docid, [term, term..]), (docid, [term,..]),.. Output: (term, [docid, docid, …]) –E.g., (apple, [1, 23, 49, 127, …]) Binary inverted lists fit on a slide more easily Everything also applies to frequency and positional inverted lists A document id is an internal document id, e.g., a unique integer Not an external document id such as a url MapReduce elements Combiner, Secondary Sort, complex keys, Sorting on keys’ fields © 2010, Jamie Callan 18

19 Using MapReduce to Construct Indexes: A Simple Approach A simple approach to creating binary inverted lists Each Map task is a document parser –Input: A stream of documents –Output: A stream of (term, docid) tuples »(long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) … Shuffle sorts tuples by key and routes tuples to Reducers Reducers convert streams of keys into streams of inverted lists –Input:(long, 1) (long, 127) (long, 49) (long, 23) … –The reducer sorts the values for a key and builds an inverted list »Longest inverted list must fit in memory –Output: (long, [df:492, docids:1, 23, 49, 127, …]) © 2010, Jamie Callan 19

20 Using MapReduce to Construct Indexes: A Simple Approach A more succinct representation of the previous algorithm Map: (docid 1, content 1 )  (t 1, docid 1 ) (t 2, docid 1 ) … Shuffle by t Sort by t (t 5, docid 1 ) (t 4, docid 3 ) …  (t 4, docid 3 ) (t 4, docid 1 ) (t 5, docid 1 ) … Reduce: (t 4, [docid 3 docid 1 …])  (t, ilist) docid:a unique integer t:a term, e.g., “apple” ilist:a complete inverted list but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory © 2010, Jamie Callan 20

21 Using MapReduce to Construct Indexes: Using Combine Map:(docid 1, content 1 )  (t 1, ilist 1,1 ) (t 2, ilist 2,1 ) (t 3, ilist 3,1 ) … –Each output inverted list covers just one document Combine Sort by t Combine: (t 1 [ilist 1,2 ilist 1,3 ilist 1,1 …])  (t 1, ilist 1,27 ) –Each output inverted list covers a sequence of documents Shuffle by t Sort by t (t 4, ilist 4,1 ) (t 5, ilist 5,3 ) …  (t 4, ilist 4,2 ) (t 4, ilist 4,4 ) (t 4, ilist 4,1 ) … Reduce: (t 7, [ilist 7,2, ilist 3,1, ilist 7,4, …])  (t 7, ilist final ) ilist i,j :the j’th inverted list fragment for term i © 2010, Jamie Callan 21

22 © 2010, Jamie Callan 22 Using MapReduce to Construct Indexes Parser / Indexer Parser / Indexer Parser / Indexer : : : : : : Merger : : A-F Documents Inverted Lists Map/Combine Processors Inverted List Fragments Processors Shuffle/SortReduce G-P Q-Z

23 Using MapReduce to Construct Partitioned Indexes Map: (docid 1, content 1 )  ([p, t 1 ], ilist 1,1 ) Combine to sort and group values ([p, t 1 ] [ilist 1,2 ilist 1,3 ilist 1,1 …])  ([p, t 1 ], ilist 1,27 ) Shuffle by p Sort values by [p, t] Reduce: ([p, t 7 ], [ilist 7,2, ilist 7,1, ilist 7,4, …])  ([p, t 7 ], ilist final ) p: partition (shard) id © 2010, Jamie Callan 23

24 Using MapReduce to Construct Indexes: Secondary Sort So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory? Map: (docid 1, content 1 )  ([t 1, fd 1,1 ], ilist 1,1 ) Combine to sort and group values Shuffle by t Sort by [t, fd], then Group by t (Secondary Sort) ([t 7, fd 7,2 ], ilist 7,2 ), ([t 7, fd 7,1 ], ilist 7,1 ) …  (t 7, [ilist 7,1, ilist 7,2, …]) Reduce: (t 7, [ilist 7,1, ilist 7,2, …])  (t 7, ilist final ) Values arrive in order, so Reduce can stream its output fd i,j is the first docid in ilist i,j © 2010, Jamie Callan 24

25 Using MapReduce to Construct Indexes: Putting it All Together Map: (docid 1, content 1 )  ([p, t 1, fd 1,1 ], ilist 1,1 ) Combine to sort and group values ([p, t 1, fd 1,1 ] [ilist 1,2 ilist 1,3 ilist 1,1 …])  ([p, t 1, fd 1,27 ], ilist 1,27 ) Shuffle by p Secondary Sort by [(p, t), fd] ([p, t 7 ], [ilist 7,2, ilist 7,1, ilist 7,4, …])  ([p, t 7 ], [ilist 7,1, ilist 7,2, ilist 7,4, …]) Reduce: ([p, t 7 ], [ilist 7,1, ilist 7,2, ilist 7,4, …])  ([p, t 7 ], ilist final ) © 2010, Jamie Callan 25

26 © 2010, Jamie Callan 26 Using MapReduce to Construct Indexes Parser / Indexer Parser / Indexer Parser / Indexer : : : : : : Merger : : Shard Documents Inverted Lists Map/Combine Processors Inverted List Fragments Processors Shuffle/SortReduce Shard

27 PageRank Calculation: Preliminaries One PageRank iteration: Input: –(id 1, [score 1 (t), out 11, out 12,..]), (id 2, [score 2 (t), out 21, out 22,..]).. Output: –(id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [score 2 (t+1), out 21, out 22,..]).. MapReduce elements Score distribution and accumulation Database join Side-effect files © 2010, Jamie Callan 27

28 PageRank: Score Distribution and Accumulation Map –In: (id 1, [score 1 (t), out 11, out 12,..]), (id 2, [score 2 (t), out 21, out 22,..]).. –Out: (out 11, score 1 (t) /n 1 ), (out 12, score 1 (t) /n 1 ).., (out 21, score 2 (t) /n 2 ),.. Shuffle & Sort by node_id –In: (id 2, score 1 ), (id 1, score 2 ), (id 1, score 1 ),.. –Out: (id 1, score 1 ), (id 1, score 2 ),.., (id 2, score 1 ),.. Reduce –In: (id 1, [score 1, score 2,..]), (id 2, [score 1,..]),.. –Out: (id 1, score 1 (t+1) ), (id 2, score 2 (t+1) ),.. © 2010, Jamie Callan 28

29 PageRank: Database Join to associate outlinks with score Map –In & Out: (id 1, score 1 (t+1) ), (id 2, score 2 (t+1) ),.., (id 1, [out 11, out 12,..]), (id 2, [out 21, out 22,..]).. Shuffle & Sort by node_id –Out: (id 1, score 1 (t+1) ), (id 1, [out 11, out 12,..]), (id 2, [out 21, out 22,..]), (id 2, score 2 (t+1) ),.. Reduce –In: (id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [out 21, out 22,.., score 2 (t+1) ]),.. –Out: (id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [score 2 (t+1), out 21, out 22,..]).. © 2010, Jamie Callan 29

30 PageRank: Side Effect Files for dangling nodes Dangling Nodes –Nodes with no outlinks (observed but not crawled URLs) –Score has no outlet »need to distribute to all graph nodes evenly Map for dangling nodes: –In:.., (id 3, [score 3 ]),.. –Out:.., ("*", 0.85×score 3 ),.. Reduce –In:.., ("*", [score 1, score 2,..]),.. –Out:.., everything else,.. –Output to side-effect: ("*", score), fed to Mapper of next iteration © 2010, Jamie Callan 30

31 Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 31

32 Manipulating Large Data Do everything in Hadoop (and HDFS) –Make sure every step is parallelized! –Any serial step breaks your design E.g. storing the URL list for a Web graph –Each node in Web graph has an id –[URL 1, URL 2, …], use line number as id – bottle neck –[(id 1, URL 1 ), (id 2, URL 2 ), …], explicit id © 2010, Le Zhao 32

33 Hadoop based Tools For Developing in Java, NetBeans plugin –http://www.hadoopstudio.org/docs.html Pig Latin, a SQL-like high level data processing script language Hive, Data warehouse, SQL Cascading, Data processing Mahout, Machine Learning algorithms on Hadoop HBase, Distributed data store as a large table More –http://hadoop.apache.org/ –http://en.wikipedia.org/wiki/Hadoop –Many other toolkits, Nutch, Cloud9, Ivory © 2010, Le Zhao 33

34 Get Your Hands Dirty Hadoop Virtual Machine –http://www.cloudera.com/developers/downloads/virtual- machine/ »This runs Hadoop 0.20 –An earlier Hadoop 0.18.0 version is here http://code.google.com/edu/parallel/tools/hadoopvm/index.ht ml Amazon EC2 Various other Hadoop clusters around The NetBeans plugin simulates Hadoop –The workflow view works on Windows –Local running & debugging works on MacOS and Linux –http://www.hadoopstudio.org/docs.html © 2010, Le Zhao 34

35 Conclusions Why large scale MapReduce advantages Hadoop uses Use cases –Map only: for totally distributive computation –Map+Reduce: for filtering & aggregation –Database join: for massive dictionary lookups –Secondary sort: for sorting on values –Inverted indexing: combiner, complex keys –PageRank: side effect files Large data © 2010, Jamie Callan 35

36 © 2010, Jamie Callan 36 For More Information L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, 2003. J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. 2004. S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. 2003. I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann. 1999. J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2). 2006. http://hadoop.apache.org/common/docs/current/mapred_tutorial.html. “Map/Reduce Tutorial”. Fetched January 21, 2010. Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009 J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010.


Download ppt "Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted."

Similar presentations


Ads by Google