Presentation is loading. Please wait.

Presentation is loading. Please wait.

How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel.

Similar presentations


Presentation on theme: "How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel."— Presentation transcript:

1 How Google would do GREP 684.02 Spring 2006

2 Google Massive datasets Massive numbers of machines, working in parallel

3 Requirements Need a programming model that –Parallelizes easily –Allows Ph.D level engineer/scientists to specify and execute NLP like tasks on the big clusters –Does not require serious expertise in parallel programming.

4 Map/Reduce Insight 1: much of the input/output is generic, so specify only the transformation required. Insight 2: the part of the process that says “do something to every item” is really easy to parallelize. Insight 3: Do something to every item and then collect the results

5 Map Output is one per line A -> [] B -> [] C ->[(C,1)] D -> [] C -> [(C,1)] Output is a possibly empty list of key/value pairs

6 Reduce The map/reduce implementation gathers together all pairs with same key, so reduce sees pairs of a key with a list of values [….(C,[1,1])…] Just takes the length of the list of values

7 Reflections This is a lot like awk, which said, “tell me what you do to each line, I’ll handle the details of delivering them to you” Behind the scenes, sensible to be clever about what the implementation does to pull pairs from a large cluster of machines, but this is not the application programmers problem.

8 Google’s (and Microsoft’s) papers http://labs.google.com/papers/mapreduce. htmlhttp://labs.google.com/papers/mapreduce. html http://labs.google.com/papers/sawzall- sciprog.pdfhttp://labs.google.com/papers/sawzall- sciprog.pdf http://www.cs.vu.nl/~ralf/MapReduce/pape r.pdfhttp://www.cs.vu.nl/~ralf/MapReduce/pape r.pdf


Download ppt "How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel."

Similar presentations


Ads by Google