Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
COSC6376 Cloud Computing Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Motivation: Why MapReduce Model Programming Model Examples
Word Count Reverse Page Link

Reading Assignment Summary due Next Tuesday in class

The CAP Theorem You can have at most two of these three properties for any shared-data system To scale out, you have to partition. That leaves either consistency or availability to choose from In almost all cases, you would choose availability over consistency C A P Availability Partition-resilience Claim: every distributed system is on one side of the triangle.

Motivation, What is MapReduce

Dealing with Lots of Data
Example: 20+ billion web pages x 20KB = terabytes ~400 hard drives (1TB) just to store the web Even more to do something with the data One computer can read ~50MB/sec from disk Three months to read the web Solution: spread the work over many machines

Commodity Clusters Standard architecture emerging:
Cluster of commodity Linux nodes Gigabit ethernet interconnect How to organize computations on this architecture? Mask issues such as hardware failure

Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between
any pair of nodes in a rack Switch Switch Switch Mem Disk CPU Mem Disk CPU Mem Disk CPU Mem Disk CPU … … Each rack contains nodes

Motivation: Large Scale Data Processing
Many tasks composed of processing lots of data to produce lots of other data Large-Scale Data Processing Want to use 1000s of CPUs But don’t want hassle of managing things MapReduce provides User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

Stable Storage First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

Distributed File System
Chunk Servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunkservers to access data

MapReduce Programming Model

What is Map/Reduce Map/Reduce Many problems can be phrased this way
Programming model from LISP (and other functional languages) Many problems can be phrased this way Easy to distribute across nodes Imagine 10,000 machines ready to help you compute anything you could cast as a MapReduce problem! This is the abstraction Google is famous for authoring It hides LOTS of difficulty of writing parallel code! The system takes care of load balancing, dead machines, etc. Nice retry/failure semantics

Basic Ideas key1 Map1 reduce1 Key n Source Map i reduce2 data key1
Reduce n Map m Key n (key, value) “indexing”

Programming Concept Map Reduce
Perform a function on individual values in a data set to create a new list of values Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] Reduce Combine values in a data set to create a new value Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)‏

MapReduce Programming Model
Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value)  list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value))  list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usu just one)

Examples

Warm up: Word Count We have a large file of words, Many words in each line Count the number of times each distinct word appears in the file(s)

Word Count using MapReduce
map(key = line, value=contents): for each word w in value: emit Intermediate(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each v in intermediate values: result += v emit(key,result)

Word Count, Illustrated
map(key=line, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 see bob run see spot throw

MapReduce WordCount Java code

Map Function How it works
Input data: tuples (e.g., lines in a text file) Apply user-defined function to process data by keys Output (key, value) tuples The definition of the output keys is normally different from the input Under the hood: The data file is split and sent to different distributed maps (that the user does not know) Results are grouped by key and stored to the local linux file system of the map

Reduce Function How it works
Group mappers’ output (key, value) tuples by key Apply a user defined function to process each group of tuples Output: typically, (key, aggregates) Under the hood Each reduce handles a number of keys Reduce pulls the results of assigned keys from maps’ results Each reduce generates one result file in the GFS (or HDFS)

Summary of the Ideas Mapper generates some kind of index for the original data Reducer apply group/aggregate based on that index Flexibility Developers are free to generate all kinds of different indices based on the original data Thus, many different types jobs can be done based on this simple framework

Example: Count URL Access Frequency
Work on the log of web page requests (session ID, URL)… Map Input: URLs Output: (URL, 1) Reduce Input (URL, 1) Output (URL, counts)

Example: Reverse Web-link Graph
Each source page has links to target pages, find out (target, list (sources)) Map Input (src URL, page content) Output (tgt URL, src URL) Reduce Output (tgt URL, list(src URL)) target urls page

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Similar presentations

Presentation on theme: "Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD"— Presentation transcript:

Similar presentations

About project

Feedback