Google Massive datasets Massive numbers of machines, working in parallel
Requirements Need a programming model that –Parallelizes easily –Allows Ph.D level engineer/scientists to specify and execute NLP like tasks on the big clusters –Does not require serious expertise in parallel programming.
Map/Reduce Insight 1: much of the input/output is generic, so specify only the transformation required. Insight 2: the part of the process that says “do something to every item” is really easy to parallelize. Insight 3: Do something to every item and then collect the results
Map Output is one per line A ->  B ->  C ->[(C,1)] D ->  C -> [(C,1)] Output is a possibly empty list of key/value pairs
Reduce The map/reduce implementation gathers together all pairs with same key, so reduce sees pairs of a key with a list of values [….(C,[1,1])…] Just takes the length of the list of values
Reflections This is a lot like awk, which said, “tell me what you do to each line, I’ll handle the details of delivering them to you” Behind the scenes, sensible to be clever about what the implementation does to pull pairs from a large cluster of machines, but this is not the application programmers problem.