How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel.

How Google would do GREP 684.02 Spring 2006

Google Massive datasets Massive numbers of machines, working in parallel

Requirements Need a programming model that –Parallelizes easily –Allows Ph.D level engineer/scientists to specify and execute NLP like tasks on the big clusters –Does not require serious expertise in parallel programming.

Map/Reduce Insight 1: much of the input/output is generic, so specify only the transformation required. Insight 2: the part of the process that says “do something to every item” is really easy to parallelize. Insight 3: Do something to every item and then collect the results

Map Output is one per line A -> [] B -> [] C ->[(C,1)] D -> [] C -> [(C,1)] Output is a possibly empty list of key/value pairs

Reduce The map/reduce implementation gathers together all pairs with same key, so reduce sees pairs of a key with a list of values [….(C,[1,1])…] Just takes the length of the list of values

Reflections This is a lot like awk, which said, “tell me what you do to each line, I’ll handle the details of delivering them to you” Behind the scenes, sensible to be clever about what the implementation does to pull pairs from a large cluster of machines, but this is not the application programmers problem.

Google’s (and Microsoft’s) papers http://labs.google.com/papers/mapreduce. htmlhttp://labs.google.com/papers/mapreduce. html http://labs.google.com/papers/sawzall- sciprog.pdfhttp://labs.google.com/papers/sawzall- sciprog.pdf http://www.cs.vu.nl/~ralf/MapReduce/pape r.pdfhttp://www.cs.vu.nl/~ralf/MapReduce/pape r.pdf

How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel.

Similar presentations

Presentation on theme: "How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel.

Similar presentations

Presentation on theme: "How Google would do GREP 684.02 Spring 2006. Google Massive datasets Massive numbers of machines, working in parallel."— Presentation transcript:

Similar presentations

About project

Feedback