Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Published byModified over 4 years ago
Presentation on theme: "Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture."— Presentation transcript:
Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of commodity Linux nodes – Gigabit Ethernet interconnect How to organize computations on this architecture?
Cluster Architecture Mem Disk CPU Mem Disk CPU … Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU … Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks
Map Reduce Map-reduce is a high-level programming system that allows database processes to be written simply. The user writes code for two functions, map and reduce. A master controller divides the input data into chunks, and assigns different processors to execute the map function on each chunk. Other processors, perhaps the same ones, are then assigned to perform the reduce function on pieces of the output from the map function.
Data Organization Data is assumed stored in files. – Typically, the files are very large compared with the files found in conventional systems. For example, one file might be all the tuples of a very large relation. Or, the file might be a terabyte of "market baskets,“ Or, the file might be the "transition matrix of the Web," which is a representation of the graph with all Web pages as nodes and hyperlinks as edges. Files are divided into chunks, which might be complete cylinders of a disk, and are typically many megabytes.
The Map Function Input is thought of as a set of key-value records. Executed by one or more processes, located at any number of processors. – Each map process is given a chunk of the entire input data on which to work. Designed to take one key-value pair as input and to produce a list of key-value pairs as output. – The types of keys and values for the output of the map function need not be the same as the types of input keys and values. – The "keys" that are output from the map function are not true keys in the database sense. That is, there can be many pairs with the same key value. The result of executing all the map processes is a collection of key-value pairs called the intermediate result. – Each pair appears at the processor that generated it.
Map Example Constructing an Inverted Index Input is a collection of documents, Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Map Function Input is a set of (i,d) pairs – i is document ID – d is corresponding document. The map function scans d and for each word w it finds, it emits the pair (w, i). – Notice that in the output, the word is the key and the document ID is the associated value. Output of map is a list of word-ID pairs. – Not necessary to catch duplicate words in the document; the elimination of duplicates can be done later, at the reduce phase. – The intermediate result is the collection of all word-ID pairs created from all the documents in the input database.
Note. The output of a map-reduce algorithm is always a set of key-value pairs. Useful in some applications to compose two or more map-reduce operations.
The Reduce Function The second user-defined function, reduce, is also executed by one or more processes, located at any number of processors. Input to reduce is a single key value from the intermediate result, together with the list of all values that appear with this key in the intermediate result. The reduce function itself combines the list of values associated with a given key k. The result is k paired with a value of some type.
Reduce Example Constructing an Inverted Index Input is a collection of documents, Final output (not as the output of map) is a list for each word of the documents that contain that word at least once. Reduce Function The intermediate result consists of pairs of the form (w, [i 1, i 2,…,i n ]), – where the i's are a list of document ID's, one for each occurrence of word w. The reduce function takes a list of ID's, eliminates duplicates, and sorts the list of unique ID's.
Parallelism This organization of the computation makes excellent use of whatever parallelism is available. The map function works on a single document, so we could have as many processes and processors as there are documents in the database. The reduce function works on a single word, so we could have as many processes and processors as there are words in the database. Of course, it is unlikely that we would use so many processors in practice.
Another Example – Word Count Construct a word count. For each word w that appears at least once in our database of documents, output pair (w, c), where c is the number of times w appears among all the documents. The map function Input is a document. Goes through the document, and each time it encounters another word w, it emits the pair (w, 1). Intermediate result is a list of pairs (w 1,1), (w 2,1),…. The reduce function Input is a pair (w, [1, 1,...,1]), with a 1 for each occurrence of word w. Sums the 1's, producing the count. Output is word-count pairs (w,c).
What about Joins? R(A, B) S(B, C) The map function Input is key-value pairs (X, t), – X is either R or S, – t is a tuple of the relation named by X. Output is a single pair (b, (R, a)) or (b, (S, c)) depending on X – b is the B-value of t. – b is the B-value of t (if X=R). – c is the C-value of t (if X=C). The reduce function Input is a pair (b, [(R,a), S(S,c), …]). Extracts all the A-values associated with R and all C-values associated with S. These are paired in all possible ways, with the b in the middle to form a tuple of the result.
Reading Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html Hadoop (Apache) – Open Source implementation of MapReduce http://hadoop.apache.org/core