Large-scale file systems and Map-Reduce

Large-scale file systems and Map-Reduce
Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer reads MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Takes even more to do something useful with the data New standard architecture is emerging: Cluster of commodity Linux nodes Gigabit ethernet interconnect Single-node architecture Memory Disk CPU Slide based on

Distributed File Systems
Files are very large, read/append. They are divided into chunks. Typically 64MB to a chunk. Chunks are replicated at several compute-nodes. A master (possibly replicated) keeps track of all locations of all chunks. Slide based on

Commodity clusters: compute nodes
Organized into racks. Intra-rack connection typically gigabit speed. Inter-rack connection faster by a small factor. Recall that chunks are replicated Some implementations: GFS (Google File System – proprietary). In Aug 2006 Google had ~450,000 machines HDFS (Hadoop Distributed File System – open source). CloudStore (Kosmix File System, open source). Slide based on

Problems with Large-scale computing on commodity hardware Challenges:
How do you distribute computation? How can we make it easy to write distributed programs? Machines fail: One server may stay up 3 years (1,000 days) If you have 1,000 servers, expect to loose 1/day People estimated Google had ~1M machines in 2011 1,000 machines fail every day! Slide based on

Issue: Copying data over a network takes time Idea:
Bring computation close to the data Store files multiple times for reliability Map-reduce addresses these problems Google’s computational/data manipulation model Elegant way to work with big data Storage Infrastructure – File system Google: GFS. Hadoop: HDFS Programming model Map-Reduce Slide based on

Problem: Answer: Typical usage pattern
If nodes fail, how to store data persistently? Answer: Distributed File System: Provides global file namespace Google GFS; Hadoop HDFS; Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common Slide based on

Replication Racks of Compute Nodes File Chunks
Slide based on

Replication 3-way replication of files, with copies on
different racks. Slide based on

Map-Reduce You write two functions, Map and Reduce.
They each have a special form to be explained. System (e.g., Hadoop) creates a large number of tasks for each function. Work is divided among tasks in a precise way. Slide based on

Map-Reduce Algorithms
Map tasks convert inputs to key-value pairs. “keys” are not necessarily unique. Outputs of Map tasks are sorted by key, and each key is assigned to one Reduce task. Reduce tasks combine values associated with a key. Slide based on

Simple map-reduce example: Word Count
We have a large file of words, one word to a line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs Different scenarios: Case 1: Entire file fits in main memory Case 2: File too large for main mem, but all <word, count> pairs fit in main mem Case 3: File on disk, too many distinct words to fit in memory Slide based on

Word Count Map task: For each word, e.g. CAT output (CAT,1)
Total output: (w1,1), (w1,1), …., (w1,1) (w2,1), (w2,1), …., (w2,1) …… Hash each (w,1) to bucket h(w) in [0,r-1] in local intermediate file. r is the number of reducers Master: Group by key: (w1,[1,1,…,1]), (w2,[1,1,…,1]), Push group (w,[1,1,..,1]) to reducer h(w) Reduce task: Reducer h(w) Read : (w,[1,1,…,1]) Aggregate: each (w,[1,1,…,1]) into (w,sum) Output: (w,sum) into common output file Since addition is commutative and associative the map task could have sent : (w1,sum1), (w2,sum2), … Reduce task would receive: (wi,sumi,1), (wi,sumi,2), … (wj,sumj,1), (wj,sumj,2), … and output (wi,sumi), (wj,sumj), …. Slide based on

Partition Function Inputs to map tasks are created by contiguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R Sometimes useful to override E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file Slide based on

Coordination Master data structures
Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures Slide based on

Data flow Input, final output are stored on a distributed file system
Scheduler tries to schedule map tasks “close” to physical storage location of input data Intermediate results are stored on local FS of map and reduce workers Output is often input to another map-reduce task Master data structures Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers Master pings workers periodically to detect failures Slide based on

Failures Map worker failure Reduce worker failure Master failure
Map tasks completed or in-progress at worker are reset to idle (result sits locally at worker) Reduce workers are notified when task is rescheduled on another worker Reduce worker failure Only in-progress tasks are reset to idle Master failure Map-reduce task is aborted and client is notified Slide based on

How many Map and Reduce jobs?
M map tasks, R reduce tasks Rule of thumb: Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure Usually R is smaller than M, because output is spread across R files Slide based on

Joining by Map-Reduce Suppose we want to compute
R(A,B) JOIN S(B,C), using k Reduce tasks. I.e., find tuples with matching B-values. R and S are each stored in a chunked file. Use a hash function h from B-values to k buckets. Bucket = Reduce task. The Map tasks take chunks from R and S, and send: Tuple R(a,b) to Reduce task h(b). Key = b value = R(a,b). Tuple S(b,c) to Reduce task h(b). Key = b; value = S(b,c). Slide based on

Reduce task i Map tasks send R(a,b) if h(b) = i S(b,c) if h(b) = i All (a,b,c) such that h(b) = i, and (a,b) is in R, and (b,c) is in S. Key point: If R(a,b) joins with S(b,c), then both tuples are sent to Reduce task h(b). Thus, their join (a,b,c) will be produced there and shipped to the output file. Slide based on

Mapping tuples in joins
Mapper for R(1,2) R(1,2) (2, (R,1)) Reducer for B = 2 (2, [(R,1), (R,4), (S,3)]) Mapper for R(4,2) R(4,2) (2, (R,4)) (2, (S,3)) (5, (S,6)) Mapper for S(2,3) S(2,3) Reducer for B = 5 (5, [(S,6)]) Mapper for S(5,6) S(5,6) Slide based on

Output of the Reducers (2, [(R,1), (R,4), (S,3)]) (1,2,3), (4,2,3)
for B = 2 (2, [(R,1), (R,4), (S,3)]) (1,2,3), (4,2,3) Reducer for B = 5 (5, [(S,6)]) Slide based on

Relational operators with map-reduce
Three-Way Join We shall consider a simple join of three relations, the natural join R(A,B) ⋈ S(B,C) ⋈ T(C,D). One way: cascade of two 2-way joins, each implemented by map-reduce. Fine, unless the 2-way joins produce large intermediate relations. Slide based on

Another 3-Way Join Reduce processes use hash values of entire S(B,C) tuples as key. Choose a hash function h that maps B- and C-values to k buckets. There are k2 Reduce processes, one for each (B-bucket, C-bucket) pair. Slide based on

Job of the Reducers Each reducer gets, for certain B-values b and C-values c : All tuples from R with B = b, All tuples from T with C = c, and The tuple S(b,c) if it exists. Thus it can create every tuple of the form (a, b, c, d) in the join. Slide based on

Mapping for 3-Way Join Aside: even normal map-reduce allows inputs to map to several key-value pairs. We map each tuple S(b,c) to ((h(b), h(c)), (S, b, c)). We map each R(a,b) tuple to ((h(b), y), (R, a, b)) for all y = 1, 2,…,k. We map each T(c,d) tuple to ((x, h(c)), (T, c, d)) for all x = 1, 2,…,k. Keys Values Slide based on

Assigning Tuples to Reducers
h(c) = T(c,d), where h(c)=3 h(b) = 0 1 2 3 S(b,c) where h(b)=1; h(c)=2 R(a,b), where h(b)=2 Slide based on

DB = (1,1,(R,1)) (1,2,(R,1)) (1,2,(S,1,2)) R(1,1) R(1,2) R(2,1) R(2,2)
T(1,1) T(1,2) T(2,1) T(2,2) DB = R(1,1) R(2,1) S(1,1) T(1,1) T(1,2) R(1,1) R(2,1) S(1,2) T(2,1) T(2,2) Mapper R(1,1) (1,1,(R,1)) (1,2,(R,1)) Mapper S(1,2) (1,2,(S,1,2)) ..etc R(1,2) R(2,2) S(2,1) T(1,1) T(1,2) R(1,2) R(2,2) S(2,2) T(2,1) T(2,2) R(1,1) ⨝ S(1,1) ⨝ T(1,1) R(2,1) ⨝ S(1,1) ⨝ T(1,1) R(2,1) ⨝ S(1,1) ⨝ T(1,2) R(1,1) ⨝ S(1,1) ⨝ T(1,2)

Large-scale file systems and Map-Reduce

Similar presentations

Presentation on theme: "Large-scale file systems and Map-Reduce"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Large-scale file systems and Map-Reduce

Similar presentations

Presentation on theme: "Large-scale file systems and Map-Reduce"— Presentation transcript:

Similar presentations

About project

Feedback