Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of this week Debugging tips for ML algorithms
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
Distributed Computations
Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Parallel Programming with Hadoop/MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Parallel Data Mining and Processing with Hadoop/MapReduce CS240A/290N, Tao Yang.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Map Reduce.
15-826: Multimedia Databases and Data Mining
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
VI-SEEM data analysis service
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
Secondary Sort  Problem: Sorting on values
MapReduce: Simplified Data Processing on Large Clusters
Tutorial for MapReduce (Hadoop) & Large Scale Processing
Map Reduce, Types, Formats and Features
Presentation transcript:

Tutorial for MapReduce (Hadoop) & Large Scale Processing Le Zhao (LTI, SCS, CMU) Database Seminar & Large Scale Seminar 2010-Feb-15 Some slides adapted from IR course lectures by Jamie Callan © 2010, Le Zhao 1

Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 2

Outline Why MapReduce (Hadoop) –Why go large scale –Compared to other parallel computing models –Hadoop related tools MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 3

Why NOT to do parallel computing Concerns: a parallel system needs to provide: –Data distribution –Computation distribution –Fault tolerance –Job scheduling © 2010, Le Zhao 4

Why MapReduce (Hadoop) Previous parallel computation models –1) scp + ssh »Manual everything –2) network cross-mounted disks + condor/torque »No data distr, disk access is bottleneck »Can only partition totally distributed computation »No fault tolerance »Prioritized job scheduling © 2010, Le Zhao 5

Hadoop Parallel batch computation –Data distribution »Hadoop Distributed File System (HDFS) »Like Linux FS, but with automatic data repetition –Computation distribution »Automatic, user only need to specify #input_splits »Can distribute aggregation computations as well –Fault tolerance »Automatic recovery from failure »Speculative execution (a backup task) –Job scheduling »Ok, but still relies on the politeness of users © 2010, Le Zhao 6

How you can use Hadoop Hadoop Streaming –Quick hacking – much like shell scripting »Uses STDIN & STDOUT carry data »cat file | mapper | sort | reducer > output –Easier to use legacy code, all programming languages Hadoop Java API –Build large systems »More data types »More control over Hadoop’s behavior »Easier debugging with Java’s error stacktrace display –NetBeans plugin for Hadoop provides easy programming » © 2010, Le Zhao 7

Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 8

© 2009, Jamie Callan 9 Map and Reduce MapReduce is a new use of an old idea in Computer Science Map: Apply a function to every object in a list –Each object is independent »Order is unimportant »Maps can be done in parallel –The function produces a result Reduce: Combine the results to produce a final result You may have seen this in a Lisp or functional programming course

© 2010, Jamie Callan 10 MapReduce Input reader –Divide input into splits, assign each split to a Map processor Map –Apply the Map function to each record in the split –Each Map function returns a list of (key, value) pairs Shuffle/Partition and Sort –Shuffle distributes sorting & aggregation to many reducers –All records for key k are directed to the same reduce processor –Sort groups the same keys together, and prepares for aggregation Reduce –Apply the Reduce function to each key –The result of the Reduce function is a list of (key, value) pairs

MapReduce in One Picture © 2010, Le Zhao 11 Tom White, Hadoop: The Definitive Guide

Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking –Two simple use cases –Two more advanced & useful MapReduce tricks –Two MapReduce applications Manipulating large data © 2010, Le Zhao 12

MapReduce Use Case (1) – Map Only Data distributive tasks – Map Only E.g. classify individual documents Map does everything –Input: (docno, doc_content), … –Output: (docno, [class, class, …]), … No reduce © 2010, Le Zhao 13

MapReduce Use Case (2) – Filtering and Accumulation Filtering & Accumulation – Map and Reduce E.g. Counting total enrollments of two given classes Map selects records and outputs initial counts –In: (Jamie, 11741), (Tom, 11493), … –Out: (11741, 1), (11493, 1), … Shuffle/Partition by class_id Sort –In: (11741, 1), (11493, 1), (11741, 1), … –Out: (11493, 1), …, (11741, 1), (11741, 1), … Reduce accumulates counts –In: (11493, [1, 1, …]), (11741, [1, 1, …]) –Sum and Output: (11493, 16), (11741, 35) © 2010, Le Zhao 14

MapReduce Use Case (3) – Database Join Problem: Massive lookups –Given two large lists: (URL, ID) and (URL, doc_content) pairs –Produce (ID, doc_content) Solution: Database join Input stream: both (URL, ID) and (URL, doc_content) lists –( 0), ( 1), … –( ), ( ), … Map simply passes input along, Shuffle and Sort on URL (group ID & doc_content for the same URL together) –Out: ( 0), ( ), ( ), ( 1), … Reduce outputs result stream of (ID, doc_content) pairs –In: ( [0, html0]), ( [html1, 1]), … –Out: (0, ), (1, ), … © 2010, Le Zhao 15

MapReduce Use Case (4) – Secondary Sort Problem: Sorting on values E.g. Reverse graph edge directions & output in node order –Input: adjacency list of graph (3 nodes and 4 edges) (3, [1, 2]) (1, [3]) (1, [2, 3])  (2, [1, 3]) (3, [1]) Note, the node_ids in the output values are also sorted. But Hadoop only sorts on keys! Solution: Secondary sort Map –In: (3, [1, 2]), (1, [2, 3]). –Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]). (reverse edge direction) –Out: (, [3]), (, [3]), (, [1]), (, [1]). –Copy node_ids from value to key  © 2010, Le Zhao 16

MapReduce Use Case (4) – Secondary Sort Secondary Sort (ctd.) Shuffle on Key.field1, and Sort on whole Key (both fields) –In: (, [3]), (, [3]), (, [1]), (, [1]) –Out: (, [3]), (, [1]), (, [3]), (, [1]) Grouping comparator –Merge according to part of the key –Out: (, [3]), (, [1, 3]), (, [1]) this will be the reducer’s input Reduce –Merge & output: (1, [3]), (2, [1, 3]), (3, [1]) © 2010, Le Zhao 17

Using MapReduce to Construct Indexes: Preliminaries Construction of binary inverted lists Input: documents: (docid, [term, term..]), (docid, [term,..]),.. Output: (term, [docid, docid, …]) –E.g., (apple, [1, 23, 49, 127, …]) Binary inverted lists fit on a slide more easily Everything also applies to frequency and positional inverted lists A document id is an internal document id, e.g., a unique integer Not an external document id such as a url MapReduce elements Combiner, Secondary Sort, complex keys, Sorting on keys’ fields © 2010, Jamie Callan 18

Using MapReduce to Construct Indexes: A Simple Approach A simple approach to creating binary inverted lists Each Map task is a document parser –Input: A stream of documents –Output: A stream of (term, docid) tuples »(long, 1) (ago, 1) (and, 1) … (once, 2) (upon, 2) … Shuffle sorts tuples by key and routes tuples to Reducers Reducers convert streams of keys into streams of inverted lists –Input:(long, 1) (long, 127) (long, 49) (long, 23) … –The reducer sorts the values for a key and builds an inverted list »Longest inverted list must fit in memory –Output: (long, [df:492, docids:1, 23, 49, 127, …]) © 2010, Jamie Callan 19

Using MapReduce to Construct Indexes: A Simple Approach A more succinct representation of the previous algorithm Map: (docid 1, content 1 )  (t 1, docid 1 ) (t 2, docid 1 ) … Shuffle by t Sort by t (t 5, docid 1 ) (t 4, docid 3 ) …  (t 4, docid 3 ) (t 4, docid 1 ) (t 5, docid 1 ) … Reduce: (t 4, [docid 3 docid 1 …])  (t, ilist) docid:a unique integer t:a term, e.g., “apple” ilist:a complete inverted list but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of a word fits in memory © 2010, Jamie Callan 20

Using MapReduce to Construct Indexes: Using Combine Map:(docid 1, content 1 )  (t 1, ilist 1,1 ) (t 2, ilist 2,1 ) (t 3, ilist 3,1 ) … –Each output inverted list covers just one document Combine Sort by t Combine: (t 1 [ilist 1,2 ilist 1,3 ilist 1,1 …])  (t 1, ilist 1,27 ) –Each output inverted list covers a sequence of documents Shuffle by t Sort by t (t 4, ilist 4,1 ) (t 5, ilist 5,3 ) …  (t 4, ilist 4,2 ) (t 4, ilist 4,4 ) (t 4, ilist 4,1 ) … Reduce: (t 7, [ilist 7,2, ilist 3,1, ilist 7,4, …])  (t 7, ilist final ) ilist i,j :the j’th inverted list fragment for term i © 2010, Jamie Callan 21

© 2010, Jamie Callan 22 Using MapReduce to Construct Indexes Parser / Indexer Parser / Indexer Parser / Indexer : : : : : : Merger : : A-F Documents Inverted Lists Map/Combine Processors Inverted List Fragments Processors Shuffle/SortReduce G-P Q-Z

Using MapReduce to Construct Partitioned Indexes Map: (docid 1, content 1 )  ([p, t 1 ], ilist 1,1 ) Combine to sort and group values ([p, t 1 ] [ilist 1,2 ilist 1,3 ilist 1,1 …])  ([p, t 1 ], ilist 1,27 ) Shuffle by p Sort values by [p, t] Reduce: ([p, t 7 ], [ilist 7,2, ilist 7,1, ilist 7,4, …])  ([p, t 7 ], ilist final ) p: partition (shard) id © 2010, Jamie Callan 23

Using MapReduce to Construct Indexes: Secondary Sort So far, we have assumed that Reduce can sort values in memory …but what if there are too many to fit in memory? Map: (docid 1, content 1 )  ([t 1, fd 1,1 ], ilist 1,1 ) Combine to sort and group values Shuffle by t Sort by [t, fd], then Group by t (Secondary Sort) ([t 7, fd 7,2 ], ilist 7,2 ), ([t 7, fd 7,1 ], ilist 7,1 ) …  (t 7, [ilist 7,1, ilist 7,2, …]) Reduce: (t 7, [ilist 7,1, ilist 7,2, …])  (t 7, ilist final ) Values arrive in order, so Reduce can stream its output fd i,j is the first docid in ilist i,j © 2010, Jamie Callan 24

Using MapReduce to Construct Indexes: Putting it All Together Map: (docid 1, content 1 )  ([p, t 1, fd 1,1 ], ilist 1,1 ) Combine to sort and group values ([p, t 1, fd 1,1 ] [ilist 1,2 ilist 1,3 ilist 1,1 …])  ([p, t 1, fd 1,27 ], ilist 1,27 ) Shuffle by p Secondary Sort by [(p, t), fd] ([p, t 7 ], [ilist 7,2, ilist 7,1, ilist 7,4, …])  ([p, t 7 ], [ilist 7,1, ilist 7,2, ilist 7,4, …]) Reduce: ([p, t 7 ], [ilist 7,1, ilist 7,2, ilist 7,4, …])  ([p, t 7 ], ilist final ) © 2010, Jamie Callan 25

© 2010, Jamie Callan 26 Using MapReduce to Construct Indexes Parser / Indexer Parser / Indexer Parser / Indexer : : : : : : Merger : : Shard Documents Inverted Lists Map/Combine Processors Inverted List Fragments Processors Shuffle/SortReduce Shard

PageRank Calculation: Preliminaries One PageRank iteration: Input: –(id 1, [score 1 (t), out 11, out 12,..]), (id 2, [score 2 (t), out 21, out 22,..]).. Output: –(id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [score 2 (t+1), out 21, out 22,..]).. MapReduce elements Score distribution and accumulation Database join Side-effect files © 2010, Jamie Callan 27

PageRank: Score Distribution and Accumulation Map –In: (id 1, [score 1 (t), out 11, out 12,..]), (id 2, [score 2 (t), out 21, out 22,..]).. –Out: (out 11, score 1 (t) /n 1 ), (out 12, score 1 (t) /n 1 ).., (out 21, score 2 (t) /n 2 ),.. Shuffle & Sort by node_id –In: (id 2, score 1 ), (id 1, score 2 ), (id 1, score 1 ),.. –Out: (id 1, score 1 ), (id 1, score 2 ),.., (id 2, score 1 ),.. Reduce –In: (id 1, [score 1, score 2,..]), (id 2, [score 1,..]),.. –Out: (id 1, score 1 (t+1) ), (id 2, score 2 (t+1) ),.. © 2010, Jamie Callan 28

PageRank: Database Join to associate outlinks with score Map –In & Out: (id 1, score 1 (t+1) ), (id 2, score 2 (t+1) ),.., (id 1, [out 11, out 12,..]), (id 2, [out 21, out 22,..]).. Shuffle & Sort by node_id –Out: (id 1, score 1 (t+1) ), (id 1, [out 11, out 12,..]), (id 2, [out 21, out 22,..]), (id 2, score 2 (t+1) ),.. Reduce –In: (id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [out 21, out 22,.., score 2 (t+1) ]),.. –Out: (id 1, [score 1 (t+1), out 11, out 12,..]), (id 2, [score 2 (t+1), out 21, out 22,..]).. © 2010, Jamie Callan 29

PageRank: Side Effect Files for dangling nodes Dangling Nodes –Nodes with no outlinks (observed but not crawled URLs) –Score has no outlet »need to distribute to all graph nodes evenly Map for dangling nodes: –In:.., (id 3, [score 3 ]),.. –Out:.., ("*", 0.85×score 3 ),.. Reduce –In:.., ("*", [score 1, score 2,..]),.. –Out:.., everything else,.. –Output to side-effect: ("*", score), fed to Mapper of next iteration © 2010, Jamie Callan 30

Outline Why MapReduce (Hadoop) MapReduce basics The MapReduce way of thinking Manipulating large data © 2010, Le Zhao 31

Manipulating Large Data Do everything in Hadoop (and HDFS) –Make sure every step is parallelized! –Any serial step breaks your design E.g. storing the URL list for a Web graph –Each node in Web graph has an id –[URL 1, URL 2, …], use line number as id – bottle neck –[(id 1, URL 1 ), (id 2, URL 2 ), …], explicit id © 2010, Le Zhao 32

Hadoop based Tools For Developing in Java, NetBeans plugin – Pig Latin, a SQL-like high level data processing script language Hive, Data warehouse, SQL Cascading, Data processing Mahout, Machine Learning algorithms on Hadoop HBase, Distributed data store as a large table More – – –Many other toolkits, Nutch, Cloud9, Ivory © 2010, Le Zhao 33

Get Your Hands Dirty Hadoop Virtual Machine – machine/ »This runs Hadoop 0.20 –An earlier Hadoop version is here ml Amazon EC2 Various other Hadoop clusters around The NetBeans plugin simulates Hadoop –The workflow view works on Windows –Local running & debugging works on MacOS and Linux – © 2010, Le Zhao 34

Conclusions Why large scale MapReduce advantages Hadoop uses Use cases –Map only: for totally distributive computation –Map+Reduce: for filtering & aggregation –Database join: for massive dictionary lookups –Secondary sort: for sorting on values –Inverted indexing: combiner, complex keys –PageRank: side effect files Large data © 2010, Jamie Callan 35

© 2010, Jamie Callan 36 For More Information L. A. Barroso, J. Dean, and U. Hölzle. “Web search for a planet: The Google cluster architecture.” IEEE Micro, J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages S. Ghemawat, H. Gobioff, and S.-T. Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages I.H. Witten, A. Moffat, and T.C. Bell. Managing Gigabytes. Morgan Kaufmann J. Zobel and A. Moffat. “Inverted files for text search engines.” ACM Computing Surveys, 38 (2) “Map/Reduce Tutorial”. Fetched January 21, Tom White. Hadoop: The Definitive Guide. O'Reilly Media. June 5, 2009 J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce, Book Draft. February 7, 2010.