Presentation is loading. Please wait.

Presentation is loading. Please wait.

MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.

Similar presentations


Presentation on theme: "MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics."— Presentation transcript:

1 MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics

2 Brief History of Google BackRub: disk drives 24 GB total storage

3 Brief History of Google BackRub: disk drives 24 GB total storage =

4 Brief History of Google Google: disk drives 366 GB total storage

5 Brief History of Google Google: disk drives 366 GB total storage =

6 Traditional Design Principles  If big enough, supercomputer processes work  Use desktop CPUs, just a lot more of them  But it also provides huge bandwidth to memory  Equivalent to many machines bandwidth at once  But supercomputers are VERY, VERY expensive  Maintenance also expensive once machine bought  But do get something: high-quality == low downtime  Safe, expensive solution to very large problems

7 Why Trade Money for Safety?

8

9 How Was Search Performed? DNS

10 How Was Search Performed? DNS

11 How Was Search Performed? DNS

12 How Was Search Performed? DNS

13 How Was Search Performed? DNS

14 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance

15 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap  Just expect failure; software provides quality

16 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap software provides quality  Just expect failure; software provides quality

17 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap software provides quality  Just expect failure; software provides quality

18 A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage

19 How Is Search Performed Now?

20 How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

21 How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

22 How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

23 How Is Search Performed Now? Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

24 Google’s Processing Model  Buy cheap machines & prepare for worst  Machines going to fail, but still cheaper approach  Important steps keep whole system reliable  Replicate data so that information losses limited  Move data freely so can always rebalance loads  These decisions lead to many other benefits  Scalability helped by focus on balancing  Search speed improved; performance much better  Utilize resources fully, since search demand varies

25 Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds

26 Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds  This process also leads to a few small downsides  Space  Power consumption  Cooling costs

27 Complexity at Google

28

29 Google Abstractions  Google File System  Handles replication to provide scalability & durability  BigTable  Manages large relational data sets  Chubby  Gonna skip past that joke; distributed locking service  MapReduce  If  If job fits, easy parallelism possible without much work

30 Google Abstractions  Google File System  Handles replication to provide scalability & durability  BigTable  Manages large relational data sets  Chubby  Gonna skip past that joke; distributed locking service  MapReduce  If  If job fits, easy parallelism possible without much work

31 Remember Google’s Problem

32 MapReduce Overview  Programming model makes details simple  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

33 MapReduce Overview provides good Façade  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

34 MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Reduce

35 MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Map: process each entry in list using some function  Reduce  Reduce: recombines data using given function

36 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

37 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

38 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

39 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

40 Pictorial View of MapReduce

41 Ex: Count Word Frequencies  Processes files separately Map Key=URL Value=text on page Key=URL Value=text on page

42 Ex: Count Word Frequencies  Processes files separately & count word freq. in each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count

43 Ex: Count Word Frequencies Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1”  In shuffle step, Maps combined & entries sorted by key  Reduce

44 Ex: Count Word Frequencies  In shuffle step, Maps combined & entries sorted by key  Reduce combines key’s results to compute final output Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“to” Value’’=“2” Key’’=“to” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“or” Value’’=“1” Key’’=“or” Value’’=“1” Key’’=“not” Value’’=“1” Key’’=“not” Value’’=“1”

45 Word Frequency Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, "1"); } } Reduce(String key, Iterator intermediate_values){ int result = 0; foreach v in intermediate_values { result += ParseInt(v); } Emit(result); }

46 Ex: Build Search Index  Processes files separately & record words found on each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL

47 Ex: Build Search Index  Processes files separately & record words found on each  To get search Map, combine key’s results in Reduce Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Reduce Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Key=word Value=URLs with word Key=word Value=URLs with word

48 Search Index Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, input_key); } } Reduce(String key, Iterator intermediate_values){ List result = new ArrayList(); foreach v in intermediate_values { result.addLast(v); } Emit(result); }

49 Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

50 Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + +

51 Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + + Repeat entire process (e.g., input Reduce results back into Map) until page ranks stabilize (sum of changes to the ranks drops below some threshold)

52 Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

53 Advanced MapReduce Ideas  How to implement? One master, many workers  Split input data into tasks where each task size fixed  Will also be partitioning reduce phase into tasks  Dynamically assign tasks to workers during each step  Tasks assigned as needed & placed in in-process list  Once worker completes task, save result & retire task  Assume that a worker crashed, if not complete in time  Move incomplete tasks back into pool for reassignment

54 Advanced MapReduce Ideas

55


Download ppt "MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics."

Similar presentations


Ads by Google