Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hector Garcia-Molina Stanford University

Similar presentations


Presentation on theme: "Hector Garcia-Molina Stanford University"— Presentation transcript:

1 Hector Garcia-Molina Stanford University
CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347 Notes 09

2 "Big Data" Open Source Systems
Infrastructure for distributed data computations Map-Reduce, S4, Hyracks, Pregel [Storm, Mupet] Components MemCachedD, ZooKeeper, Kestrel Data services Pig, F1 Cassandra, H-Base, Big Table [Hive] CS347 Notes 09

3 Motivation for Map-Reduce
Recall one of our sort strategies: Local sort R1 R’1 ko Result Local sort R2 R’2 k1 Local sort R3 R’3 process data & partition additional processing CS 347 Notes 03

4 Another example: Asymmetric fragment + replicate join
Local join Ra R1 R2 R3 S Sa Rb Sb f partition Result union process data & partition additional processing CS 347 Notes 03

5 Building Text Index - Part I
original Map-Reduce application.... FLUSHING 1 rat (rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3) (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) Intermediate runs Page stream dog 2 dog cat Spend some time explaining pieces of figure Say that blue is buffer What the three stages are Mention that what is flushed to disk is the intermediate run Different resources are used by diff stages of the process – (network, CPU, IO) Hence obvious approach is to execute them simultaneously rat Disk 3 dog Loading Tokenizing Sorting CS347 Notes 09

6 Building Text Index - Part II
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) (ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3) (ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3) Merge Spend some time explaining pieces of figure Say that blue is buffer What the three stages are Mention that what is flushed to disk is the intermediate run Different resources are used by diff stages of the process – (network, CPU, IO) Hence obvious approach is to execute them simultaneously (ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6) Intermediate Runs CS347 Notes 09 Final index

7 Generalizing: Map-Reduce
FLUSHING 1 rat (rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3) (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) Intermediate runs Page stream dog 2 dog cat Spend some time explaining pieces of figure Say that blue is buffer What the three stages are Mention that what is flushed to disk is the intermediate run Different resources are used by diff stages of the process – (network, CPU, IO) Hence obvious approach is to execute them simultaneously rat Disk 3 dog Loading Tokenizing Sorting CS347 Notes 09

8 Generalizing: Map-Reduce
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) (ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3) (ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3) Merge Spend some time explaining pieces of figure Say that blue is buffer What the three stages are Mention that what is flushed to disk is the intermediate run Different resources are used by diff stages of the process – (network, CPU, IO) Hence obvious approach is to execute them simultaneously (ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6) Intermediate Runs Reduce CS347 Notes 09 Final index

9 Map Reduce Input: R={r1, r2, ...rn}, functions M, R
M(ri)  { [k1, v1], [k2, v2],.. } R(ki, valSet)  [ki, valSet’] Let S={ [k, v] | [k, v]  M(r) for some r  R } Let K = {k | [k,v]  S, for any v } Let G(k) = { v | [k, v]  S } Output = { [k, T] | k  K, T=R(k, G(k)) } S is bag G is bag CS347 Notes 09

10 References MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available at Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, Benjamin Reedy, Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins, available at CS347 Notes 09

11 Example: Counting Word Occurrences
map(String doc, String value); // doc is document name // value is document content for each word w in value: EmitIntermediate(w, “1”); Example: map(doc, “cat dog cat bat dog”) emits [cat 1], [dog 1], [cat 1], [bat 1], [dog 1] Why does map have 2 parameters? CS347 Notes 09

12 Example: Counting Word Occurrences
reduce(String key, Iterator values); // key is a word // values is a list of counts int result = 0; for each v in values: result += ParseInt(v) Emit(AsString(result)); Example: reduce(“dog”, “ ”) emits “4” should emit (“dog”, 4)?? CS347 Notes 09

13 Google MR Overview CS347 Notes 09

14 Implementation Issues
Combine function File system Partition of input, keys Failures Backup tasks Ordering of results CS347 Notes 09

15 Combine Function [cat 1], [cat 1], [cat 1]... worker worker worker [dog 1], [dog 1]... Combine is like a local reduce applied before distribution: worker [cat 3]... worker worker [dog 2]... CS347 Notes 09

16 Distributed File System
reduce worker must be able to access local disks on map workers all data transfers are through distributed file system any worker must be able to write its part of answer; answer is left as distributed file worker must be able to access any part of input file CS347 Notes 09

17 Partition of input, keys
How many workers, partitions of input file? How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks How many splits? worker 1 2 3 Should workers be assigned to splits “near” them? worker 9 Similar questions for reduce workers worker CS347 Notes 09

18 Failures Distributed implementation should produce same output as would have been produced by a non-faulty sequential execution of the program. General strategy: Master detects worker failures, and has work re-done by another worker. master ok? split j worker redo j worker CS347 Notes 09

19 Must be able to eliminate
Backup Tasks Straggler is a machine that takes unusually long (e.g., bad disk) to finish its work. A straggler can delay final completion. When task is close to finishing, master schedules backup executions for remaining tasks. Must be able to eliminate redundant results CS347 Notes 09

20 Ordering of Results Final result (at each node) is in key order
also in key order: [k1, v1] [k3, v3] [k1, T1] [k2, T2] [k3, T3] [k4, T4] CS347 Notes 09

21 Example: Sorting Records
W1 W5 one or two records for k=6? W2 W3 W6 Map: extract k, output [k, record] CS347 Notes 09 Reduce: Do nothing!

22 Other Issues Skipping bad records Debugging CS347 Notes 09

23 MR Claimed Advantages Model easy to use, hides details of parallelization, fault recovery Many problems expressible in MR framework Scales to thousands of machines CS347 Notes 09

24 MR Possible Disadvantages
1-input 2-stage data flow rigid, hard to adapt to other scenarios Custom code needs to be written even for the most common operations, e.g., projection and filtering Opaque nature of map, reduce functions impedes optimization CS347 Notes 09

25 Questions Can MR be made more “declarative”? How can we perform joins?
How can we perform approximate grouping? example: for all keys that are similar reduce all values for those keys CS347 Notes 09

26 Additional Topics Hadoop: open-source Map-Reduce system
Pig: Yahoo system that builds on MR but is more declarative CS347 Notes 09

27 Pig & Pig Latin A layer on top of map-reduce (Hadoop)
Pig is the system Pig Latin is the query language Pig Latin is a hybrid between: high-level declarative query language in the spirit of SQL low-level, procedural programming à la map-reduce. CS347 Notes 09

28 Example Table urls: (url, category, pagerank)
Find, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. In SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106 CS347 Notes 09

29 Example in Pig Latin In Pig Latin:
SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 106 In Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>106; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); CS347 Notes 09

30 good_urls = FILTER urls BY pagerank > 0.2;
url, category, pagerank good_urls: url, category, pagerank CS347 Notes 09

31 groups = GROUP good_urls BY category;
url, category, pagerank groups: category, good_urls CS347 Notes 09

32 big_groups = FILTER groups BY COUNT(good_urls)>1;
category, good_urls big_groups: category, good_urls CS347 Notes 09

33 output = FOREACH big_groups GENERATE category, AVG(good_urls
output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); big_groups: category, good_urls output: category, good_urls CS347 Notes 09

34 Features Similar to specifying a query execution plan (i.e., a dataflow graph), thereby making it easier for programmers to understand and control how their data processing task is executed. Support for a flexible, fully nested data model Extensive support for user-defined functions Ability to operate over plain input files without any schema information. Novel debugging environment useful when dealing with enormous data sets. CS347 Notes 09

35 Execution Control: Good or Bad?
Example: spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank>0.8; Should system re-order filters? CS347 Notes 09

36 User Defined Functions
Example groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls); should be groups.url ? .gov {(x.fbi.gov, .gov, 0.7) ...} .edu {(y.yale.edu, .edu, 0.5) ...} .com {(z.cnn.com, .com, 0.9) ...} UDF top10 can return scalar or set .gov {(fbi.gov) (cia.gov) ...} .edu {(yale.edu) ...} .com {(cnn.com) (ibm.com) ...} CS347 Notes 09

37 Data Model Atom, e.g., `alice' Tuple, e.g., (`alice', `lakers')
Bag, e.g., { (`alice', `lakers') (`alice', (`iPod', `apple')} Map, e.g., [ `fan of'  { (`lakers') (`iPod') } `age‘  20 ] Note: Bags can currently only hold tuples. So {1, 2, 3} is stored as {(1) (2) (3)} CS347 Notes 09

38 Expressions in Pig Latin
Should be (1) + (2) See flatten examples ahead CS347 Notes 09

39 Specifying Input Data handle for future use input file queries = LOAD `query_log.txt' USING myLoad() AS (userId, queryString, timestamp); custom deserializer output schema CS347 Notes 09

40 For Each expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); See example next slide Note each tuple is processed independently; good for parallelism To remove one level of nesting: expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString)); CS347 Notes 09

41 ForEach and Flattening
“lakers rumors” is a single string value plus userid CS347 Notes 09

42 Flattening Example (Fill In)
X A B C Y = FOREACH X GENERATE A, FLATTEN(B), C CS347 Notes 09

43 Flattening Example (Fill In)
Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) Is Z=Z’ where Z’ = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) ? CS347 Notes 09

44 Flatten is not recursive
Flattening Example X A B C Note first tuple is (a1, b1, b2, {(c1)(c2)}) Y = FOREACH X GENERATE A, FLATTEN(B), C Flatten is not recursive Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}. CS347 Notes 09

45 Flattening Example Y = FOREACH X GENERATE A, FLATTEN(B), C
Z = FOREACH Y GENERATE A, B, FLATTEN(C) Note that Z=Z’ where Z’ = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) CS347 Notes 09

46 Filter real_queries = FILTER queries BY userId neq `bot';
real_queries = FILTER queries BY NOT isBot(userId); UDF function CS347 Notes 09

47 Co-Group Two data sets for example:
results: (queryString, url, position) revenue: (queryString, adSlot, amount) grouped_data = COGROUP results BY queryString, revenue BY queryString; url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRevenue(results, revenue)); Co-Group more flexible than SQL JOIN CS347 Notes 09

48 CoGroup vs Join CS347 Notes 09

49 Group (Simple CoGroup)
grouped_revenue = GROUP revenue BY queryString; query_revenues = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue; CS347 Notes 09

50 CoGroup Example 1 X A B C Y A B D Z1 = GROUP X BY A Z1 A X CS347
Notes 09

51 CoGroup Example 1 X A B C Y A B D Z1 = GROUP X BY A Z1 A X CS347
Notes 09

52 Syntax not in paper but being added
CoGroup Example 2 X A B C Y A B D Syntax not in paper but being added Z2 = GROUP X BY (A, B) Z ? X CS347 Notes 09

53 Syntax not in paper but being added
CoGroup Example 2 X A B C Y A B D Syntax not in paper but being added Z2 = GROUP X BY (A, B) Z A/B? X CS347 Notes 09

54 CoGroup Example 3 X A B C Y A B D Z3 = COGROUP X BY A, Y BY A Z1 A X Y
CS347 Notes 09

55 CoGroup Example 3 X A B C Y A B D Z3 = COGROUP X BY A, Y BY A Z1 A X Y
CS347 Notes 09

56 CoGroup Example 4 X A B C Y A B D Z4 = COGROUP X BY A, Y BY B Z1 A X Y
CS347 Notes 09

57 CoGroup Example 4 X A B C Y A B D Z4 = COGROUP X BY A, Y BY B Z1 A X Y
CS347 Notes 09

58 CoGroup With Function Call?
X A B Y = GROUP X BY A Z = GROUP X BY SUM(A) Adds integers in tuple Y A X Z ? X CS347 Notes 09

59 CoGroup With Function Call?
X A B Y = GROUP X BY A Z = GROUP X BY SUM(A) Adds integers in tuple Y A X Z SUM(A)/A? X CS347 Notes 09

60 Pig Latin Join join_result = JOIN results BY queryString, revenue BY queryString; Shorthand for: temp_var = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH temp_var GENERATE FLATTEN(results), FLATTEN(revenue); CS347 Notes 09

61 MapReduce in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_groups = GROUP map_result BY $0; output = FOREACH key_groups GENERATE reduce(*); key is first attribute all attributes CS347 Notes 09

62 Store To materialize result in a file:
STORE query_revenues INTO `myoutput' USING myStore(); output file custom serializer CS347 Notes 09

63 Hadoop HDFS: Hadoop file system How to use Hadoop, examples CS347
Notes 09


Download ppt "Hector Garcia-Molina Stanford University"

Similar presentations


Ads by Google