Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Similar presentations


Presentation on theme: "Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

1 Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
COSC6376 Cloud Computing Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

2 Outline Motivation: Why MapReduce Model Programming Model Examples
Word Count Reverse Page Link

3 Reading Assignment Summary due Next Tuesday in class

4 The CAP Theorem You can have at most two of these three properties for any shared-data system To scale out, you have to partition. That leaves either consistency or availability to choose from In almost all cases, you would choose availability over consistency C A P Availability Partition-resilience Claim: every distributed system is on one side of the triangle.

5 Motivation, What is MapReduce

6 Dealing with Lots of Data
Example: 20+ billion web pages x 20KB = terabytes ~400 hard drives (1TB) just to store the web Even more to do something with the data One computer can read ~50MB/sec from disk Three months to read the web Solution: spread the work over many machines

7 Commodity Clusters Standard architecture emerging:
Cluster of commodity Linux nodes Gigabit ethernet interconnect How to organize computations on this architecture? Mask issues such as hardware failure

8 Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between
any pair of nodes in a rack Switch Switch Switch Mem Disk CPU Mem Disk CPU Mem Disk CPU Mem Disk CPU Each rack contains nodes

9 Motivation: Large Scale Data Processing
Many tasks composed of processing lots of data to produce lots of other data Large-Scale Data Processing Want to use 1000s of CPUs But don’t want hassle of managing things MapReduce provides User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

10 Stable Storage First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

11 Distributed File System
Chunk Servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunkservers to access data

12 MapReduce Programming Model

13 What is Map/Reduce Map/Reduce Many problems can be phrased this way
Programming model from LISP (and other functional languages) Many problems can be phrased this way Easy to distribute across nodes Imagine 10,000 machines ready to help you compute anything you could cast as a MapReduce problem! This is the abstraction Google is famous for authoring It hides LOTS of difficulty of writing parallel code! The system takes care of load balancing, dead machines, etc. Nice retry/failure semantics

14 Basic Ideas key1 Map1 reduce1 Key n Source Map i reduce2 data key1
Reduce n Map m Key n (key, value) “indexing”

15 Programming Concept Map Reduce
Perform a function on individual values in a data set to create a new list of values Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] Reduce Combine values in a data set to create a new value Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)‏

16 MapReduce Programming Model
Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value)  list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value))  list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usu just one)

17 Examples

18 Warm up: Word Count We have a large file of words, Many words in each line Count the number of times each distinct word appears in the file(s)

19 Word Count using MapReduce
map(key = line, value=contents): for each word w in value: emit Intermediate(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each v in intermediate values: result += v emit(key,result)

20 Word Count, Illustrated
map(key=line, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 see bob run see spot throw

21 MapReduce WordCount Java code

22 Map Function How it works
Input data: tuples (e.g., lines in a text file) Apply user-defined function to process data by keys Output (key, value) tuples The definition of the output keys is normally different from the input Under the hood: The data file is split and sent to different distributed maps (that the user does not know) Results are grouped by key and stored to the local linux file system of the map

23 Reduce Function How it works
Group mappers’ output (key, value) tuples by key Apply a user defined function to process each group of tuples Output: typically, (key, aggregates) Under the hood Each reduce handles a number of keys Reduce pulls the results of assigned keys from maps’ results Each reduce generates one result file in the GFS (or HDFS)

24 Summary of the Ideas Mapper generates some kind of index for the original data Reducer apply group/aggregate based on that index Flexibility Developers are free to generate all kinds of different indices based on the original data Thus, many different types jobs can be done based on this simple framework

25 Example: Count URL Access Frequency
Work on the log of web page requests (session ID, URL)… Map Input: URLs Output: (URL, 1) Reduce Input (URL, 1) Output (URL, counts)

26 Example: Reverse Web-link Graph
Each source page has links to target pages, find out (target, list (sources)) Map Input (src URL, page content) Output (tgt URL, src URL) Reduce Output (tgt URL, list(src URL)) target urls page


Download ppt "Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD"

Similar presentations


Ads by Google