CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Presented By: Imranul Hoque
Lecture 10: Parallel Databases Wednesday, December 1 st, 2010 Dan Suciu -- CSEP544 Fall
CS347: MapReduce CS Motivation for Map-Reduce Distribution makes simple computations complex Communication Load balancing Fault tolerance … What.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
The Pig Experience: Building High-Level Data flows on top of Map-Reduce The Pig Experience: Building High-Level Data flows on top of Map-Reduce DISTRIBUTED.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Lecture 09: Parallel Databases, Big Data, Map/Reduce, Pig-Latin Wednesday, November 23 rd, 2011 Dan Suciu -- CSEP544 Fall
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.
CS 347MapReduce1 CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce Hector Garcia-Molina Zoltan Gyongyi.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pig Latin - A Not-So-Foreign Language for Data Processing
MapReduce Simplied Data Processing on Large Clusters
Pig Latin: A Not-So-Foreign Language for Data Processing
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Hector Garcia-Molina Stanford University
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1

"Big Data" Open Source Systems Infrastructure for distributed data computations –Map-Reduce, S4, Hyracks, Pregel [Storm, Mupet] Components –MemCachedD, ZooKeeper, Kestrel Data services –Pig, F1 Cassandra, H-Base, Big Table [Hive] CS347Notes 09 2

CS 347Notes 03 3 R’1 R’2 R’3 ko k1 Local sort R1 R2 R3 Result Motivation for Map-Reduce Recall one of our sort strategies: process data & partition additional processing

CS 347Notes 03 4 Another example: Asymmetric fragment + replicate join Ra S S S Rb R1 R2 R3 Sa Sb Local join Result f partition union process data & partition additional processing

5 Building Text Index - Part I rat dog cat rat dog (rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3) (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) Disk Page stream Tokenizing Sorting Loading FLUSHING Intermediate runs CS347Notes 09 original Map-Reduce application....

6 Building Text Index - Part II (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) Intermediate Runs (ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6) Merge (ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3) Final index (ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3) CS347Notes 09

7 Generalizing: Map-Reduce rat dog cat rat dog (rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3) (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) Disk Page stream Tokenizing Sorting Loading FLUSHING Intermediate runs Map CS347Notes 09

8 Generalizing: Map-Reduce (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) Intermediate Runs Final index (ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6) Merge (ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3) (ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3) Reduce CS347Notes 09

9 Map Reduce Input: R={r 1, r 2,...r n }, functions M, R –M(r i )  { [k 1, v 1 ], [k 2, v 2 ],.. } –R(k i, valSet)  [k i, valSet’] Let S={ [k, v] | [k, v]  M(r) for some r  R } Let K = {k | [k,v]  S, for any v } Let G(k) = { v | [k, v]  S } Output = { [k, T] | k  K, T=R(k, G(k)) } S is bag G is bag CS347Notes 09

10 References MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available at Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, Benjamin Reedy, Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins, available at CS347Notes 09

11 Example: Counting Word Occurrences map(String doc, String value); // doc is document name // value is document content for each word w in value: EmitIntermediate(w, “1”); Example: –map(doc, “cat dog cat bat dog”) emits [cat 1], [dog 1], [cat 1], [bat 1], [dog 1] Why does map have 2 parameters? CS347Notes 09

12 Example: Counting Word Occurrences reduce(String key, Iterator values); // key is a word // values is a list of counts int result = 0; for each v in values: result += ParseInt(v) Emit(AsString(result)); Example: –reduce(“dog”, “ ”) emits “4” should emit (“dog”, 4)?? CS347Notes 09

13 Google MR Overview CS347Notes 09

14 Implementation Issues Combine function File system Partition of input, keys Failures Backup tasks Ordering of results CS347Notes 09

15 Combine Function worker [cat 1], [cat 1], [cat 1]... [dog 1], [dog 1]... worker [cat 3]... [dog 2]... Combine is like a local reduce applied before distribution: CS347Notes 09

16 Distributed File System worker must be able to access any part of input file reduce worker must be able to access local disks on map workers any worker must be able to write its part of answer; answer is left as distributed file all data transfers are through distributed file system CS347Notes 09

17 Partition of input, keys How many workers, partitions of input file? worker How many splits? How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks Should workers be assigned to splits “near” them? Similar questions for reduce workers CS347Notes 09

18 Failures Distributed implementation should produce same output as would have been produced by a non- faulty sequential execution of the program. General strategy: Master detects worker failures, and has work re-done by another worker. worker split j master ok? redo j CS347Notes 09

19 Backup Tasks Straggler is a machine that takes unusually long (e.g., bad disk) to finish its work. A straggler can delay final completion. When task is close to finishing, master schedules backup executions for remaining tasks. Must be able to eliminate redundant results CS347Notes 09

20 Ordering of Results Final result (at each node) is in key order [k 1, T 1 ] [k 2, T 2 ] [k 3, T 3 ] [k 4, T 4 ] [k 1, v 1 ] [k 3, v 3 ] also in key order: CS347Notes 09

21 Example: Sorting Records W1 W2 W3 W5 W6 Map: extract k, output [k, record] Reduce: Do nothing! one or two records for k=6? CS347Notes 09

22 Other Issues Skipping bad records Debugging CS347Notes 09

23 MR Claimed Advantages Model easy to use, hides details of parallelization, fault recovery Many problems expressible in MR framework Scales to thousands of machines CS347Notes 09

24 MR Possible Disadvantages 1-input 2-stage data flow rigid, hard to adapt to other scenarios Custom code needs to be written even for the most common operations, e.g., projection and filtering Opaque nature of map, reduce functions impedes optimization CS347Notes 09

25 Questions Can MR be made more “declarative”? How can we perform joins? How can we perform approximate grouping? –example: for all keys that are similar reduce all values for those keys CS347Notes 09

26 Additional Topics Hadoop: open-source Map-Reduce system Pig: Yahoo system that builds on MR but is more declarative CS347Notes 09

27 Pig & Pig Latin A layer on top of map-reduce (Hadoop) –Pig is the system –Pig Latin is the query language Pig Latin is a hybrid between: –high-level declarative query language in the spirit of SQL –low-level, procedural programming à la map- reduce. CS347Notes 09

28 Example Table urls: (url, category, pagerank) Find, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. In SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6 CS347Notes 09

29 Example in Pig Latin SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6 In Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10 6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); CS347Notes 09

30 good_urls = FILTER urls BY pagerank > 0.2; urls: url, category, pagerank good_urls: url, category, pagerank CS347Notes 09

31 groups = GROUP good_urls BY category; good_urls: url, category, pagerank groups: category, good_urls CS347Notes 09

32 big_groups = FILTER groups BY COUNT(good_urls)>1; groups: category, good_urls big_groups: category, good_urls CS347Notes 09

33 output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); big_groups: category, good_urls output: category, good_urls CS347Notes 09

34 Features Similar to specifying a query execution plan (i.e., a dataflow graph), thereby making it easier for programmers to understand and control how their data processing task is executed. Support for a flexible, fully nested data model Extensive support for user-defined functions Ability to operate over plain input files without any schema information. Novel debugging environment useful when dealing with enormous data sets. CS347Notes 09

35 Execution Control: Good or Bad? Example: spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank>0.8; Should system re-order filters? CS347Notes 09

36 User Defined Functions Example –groups = GROUP urls BY category; –output = FOREACH groups GENERATE category, top10(urls); UDF top10 can return scalar or set.gov{(x.fbi.gov,.gov, 0.7)...}.edu{(y.yale.edu,.edu, 0.5)...}.com{(z.cnn.com,.com, 0.9)...}.gov{(fbi.gov) (cia.gov)...}.edu{(yale.edu)...}.com{(cnn.com) (ibm.com)...} should be groups.url ? CS347Notes 09

37 Data Model Atom, e.g., `alice' Tuple, e.g., (`alice', `lakers') Bag, e.g., { (`alice', `lakers') (`alice', (`iPod', `apple')} Map, e.g., [ `fan of'  { (`lakers') (`iPod') } `age‘  20 ] Note: Bags can currently only hold tuples. So {1, 2, 3} is stored as {(1) (2) (3)} CS347Notes 09

38 Expressions in Pig Latin See flatten examples ahead Should be (1) + (2) CS347Notes 09

39 Specifying Input Data queries = LOAD `query_log.txt ' USING myLoad() AS (userId, queryString, timestamp); custom deserializer input file output schema handle for future use CS347Notes 09

40 For Each expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); See example next slide Note each tuple is processed independently; good for parallelism To remove one level of nesting: expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString)); CS347Notes 09

41 ForEach and Flattening plus userid “lakers rumors” is a single string value CS347Notes 09

42 Flattening Example (Fill In) X A B C Y = FOREACH X GENERATE A, FLATTEN(B), C CS347Notes 09

43 Flattening Example (Fill In) Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) Z’ = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) ? Is Z=Z’ where CS347Notes 09

44 Flattening Example X A B C Y = FOREACH X GENERATE A, FLATTEN(B), C Flatten is not recursive Note first tuple is (a1, b1, b2, {(c1)(c2)}) Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}. CS347Notes 09

45 Flattening Example Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) Z’ = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) Note that Z=Z’ where CS347Notes 09

46 Filter real_queries = FILTER queries BY userId neq `bot'; real_queries = FILTER queries BY NOT isBot(userId); UDF function CS347Notes 09

47 Co-Group Two data sets for example: –results: (queryString, url, position) –revenue: (queryString, adSlot, amount) grouped_data = COGROUP results BY queryString, revenue BY queryString; url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRevenue(results, revenue)); Co-Group more flexible than SQL JOIN CS347Notes 09

48 CoGroup vs Join CS347Notes 09

49 Group (Simple CoGroup) grouped_revenue = GROUP revenue BY queryString; query_revenues = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue; CS347Notes 09

50 CoGroup Example 1 X A B C Y A B D Z1 = GROUP X BY A Z1 A X CS347Notes 09

51 CoGroup Example 1 X A B C Y A B D Z1 = GROUP X BY A Z1 A X CS347Notes 09

52 CoGroup Example 2 X A B C Y A B D Z2 = GROUP X BY (A, B) Z1 ? X Syntax not in paper but being added CS347Notes 09

53 CoGroup Example 2 X A B C Y A B D Z2 = GROUP X BY (A, B) Z1 A/B? X Syntax not in paper but being added CS347Notes 09

54 CoGroup Example 3 X A B C Y A B D Z3 = COGROUP X BY A, Y BY A Z1 A X Y CS347Notes 09

55 CoGroup Example 3 X A B C Y A B D Z3 = COGROUP X BY A, Y BY A Z1 A X Y CS347Notes 09

56 CoGroup Example 4 X A B C Y A B D Z4 = COGROUP X BY A, Y BY B Z1 A X Y CS347Notes 09

57 CoGroup Example 4 X A B C Y A B D Z4 = COGROUP X BY A, Y BY B Z1 A X Y CS347Notes 09

58 CoGroup With Function Call? X A B Y = GROUP X BY A Z = GROUP X BY SUM(A) Y A X Z ? X Adds integers in tuple CS347Notes 09

59 CoGroup With Function Call? X A B Y = GROUP X BY A Z = GROUP X BY SUM(A) Y A X Z SUM(A)/A? X Adds integers in tuple CS347Notes 09

60 Pig Latin Join join_result = JOIN results BY queryString, revenue BY queryString; Shorthand for: temp_var = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH temp_var GENERATE FLATTEN(results), FLATTEN(revenue); CS347Notes 09

61 MapReduce in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_groups = GROUP map_result BY $0; output = FOREACH key_groups GENERATE reduce(*); all attributes key is first attribute CS347Notes 09

62 Store To materialize result in a file: STORE query_revenues INTO `myoutput' USING myStore(); custom serializer output file CS347Notes 09

63 Hadoop HDFS: Hadoop file system How to use Hadoop, examples CS347Notes 09