COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University http://www.eng.auburn.edu/~xqin xqin@auburn.edu 1

Review: Explicit Threads (Cons) versus Directive Based Programming (Pros) Directives layered on top of threads facilitate a variety of thread-related tasks. A programmer gets rid of the tasks of initializing attributes objects, setting up arguments to threads, partitioning iteration spaces, etc. 2

Review: Explicit Threads (Pros) versus Directive Based Programming (Cons) An artifact of explicit threading is that data exchange is more apparent. This helps in alleviating some of the overheads from data movement, false sharing, and contention. Explicit threading also provides a richer API in the form of condition waits, locks of different types, and increased flexibility for building composite synchronization operations. Finally, since explicit threading is used more widely than OpenMP, tools and support for Pthreads programs are easier to find. 3

Before MapReduce… Large scale data processing was difficult! (Why?) – Managing hundreds or thousands of processors – Managing parallelization and distribution – I/O Scheduling – Status and monitoring – Fault/crash tolerance MapReduce provides all of these, easily! – Introduction based on Google’s paper. 4 Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. (see also OSDI'04)

MapReduce Overview What is it? – Programming model used by Google – A combination of the Map and Reduce models with an associated implementation – Used for processing and generating large data sets How does it solve our previously mentioned problems? – MapReduce is highly scalable and can be used across many computers on a clusters. – Many small machines can be used to process jobs that normally could not be processed by a large machine. 5

Big Data Applications Youtube’s getting 48 Hours of videos every minute. Facebook’s receiving 35,000 “Likes” every second. 34000 people “tweet” every minute. 100 TB of data uploaded to Facebook everyday.

Map-Reduce Framework 7

MapReduce Usage Large-Scale Data Processing – Can make use of 1000s of CPUs – Avoid the hassle of managing parallelization Provide a complete run-time system – Automatic parallelization & distribution – Fault tolerance – I/O scheduling – Monitoring & status updates User Growth at Google (2004) 8

MapReduce Basic Ingredients Programmers specify two functions: map (k, v) → * reduce (k’, v’) → * – All values with the same key are sent to the same reducer The execution framework handles everything else… 9

map Shuffle and Sort: aggregate values by keys reduce k1k1 k2k2 k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78 a15b27c2368 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 10

MapReduce – Two Phases Programmers specify two functions: map (k, v) → * reduce (k’, v’) → * – All values with the same key are reduced together The execution framework handles everything else… Not quite…usually, programmers also specify: combine (k’, v’) → * – Mini-reducers that run in memory after the map phase – Used as an optimization to reduce network traffic partition (k’, number of partitions) → partition for k’ – Often a simple hash of the key, e.g., hash(k’) mod n – Divides up key space for parallel reduce operations 11

combine ba12c9ac52bc78 partition map k1k1 k2k2 k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78 Shuffle and Sort: aggregate values by keys reduce a15b27c298 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 c2368 12

Map Abstraction Inputs a key/value pair – Key is a reference to the input value – Value is the data set on which to operate Evaluation – Function defined by user – Applies to every value in value input Might need to parse input Produces a new list of key/value pairs – Can be different type from input pair 13

Reduce Abstraction Typically a function that: – Starts with a large number of key/value pairs One key/value for each word in all files being grepped (including multiple entries for the same word) – Ends with very few key/value pairs One key/value for each unique word across all the files with the number of instances summed into this entry Broken up so a given worker works with input of the same key. 14

count words in docs Input consists of (url, contents) pairs map(key=url, val=contents): – For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): – Sum all “1”s in values list – Emit result “(word, sum)” 15

Word Count: Illustrated map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see bob throw see spot run see1 bob1 run1 see 1 spot 1 throw1 bob1 run1 see 2 spot 1 throw1 16

Big Data: Solution “Googled” MapReduce! – Divide and Conquer. – Google File System (GFS) to store data. Apache – Framework for running applications on large clusters of commodity hardware. – Storage: HDFS. – Processing: MapReduce 17

Hadoop is – Economical – Easy to use – Portable – Reliable. Infrastructure needed, are in Data centers. Facebook’s Hadoop cluster has 30PB storage. Yahoo!, Amazon & Google all have Hadoop Data centers Hadoop in Data centers 18

Hadoop Architecture Distributed Storage (HDFS) Distributed Processing (Map Reduce) 19

Summary Map-Reduce framework Map abstraction Reduce abstraction An example: Word Count

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Similar presentations

Presentation on theme: "COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Similar presentations

Presentation on theme: "COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University"— Presentation transcript:

Similar presentations

About project

Feedback