Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Overview of MapReduce and Hadoop
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Cloud Computing.
Map reduce Cs 595 Lecture 11.
Large-scale file systems and Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters
The Map-Reduce Framework
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Map-Reduce framework -By Jagadish Rouniyar.
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Introduction to MapReduce
CS427 Multicore Architecture and Parallel Computing
CS639: Data Management for Data Science
MapReduce: Simplified Data Processing on Large Clusters
5/7/2019 Map Reduce Map reduce.
Big Data Analysis MapReduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD COSC6376 Cloud Computing Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Motivation: Why MapReduce Model Programming Model Examples Word Count Reverse Page Link

Reading Assignment Summary due Next Tuesday in class

The CAP Theorem You can have at most two of these three properties for any shared-data system To scale out, you have to partition. That leaves either consistency or availability to choose from In almost all cases, you would choose availability over consistency C A P Availability Partition-resilience Claim: every distributed system is on one side of the triangle.

Motivation, What is MapReduce

Dealing with Lots of Data Example: 20+ billion web pages x 20KB = 400 + terabytes ~400 hard drives (1TB) just to store the web Even more to do something with the data One computer can read ~50MB/sec from disk Three months to read the web Solution: spread the work over many machines

Commodity Clusters Standard architecture emerging: Cluster of commodity Linux nodes Gigabit ethernet interconnect How to organize computations on this architecture? Mask issues such as hardware failure

Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch Switch Mem Disk CPU Mem Disk CPU Mem Disk CPU Mem Disk CPU … … Each rack contains 16-64 nodes

Motivation: Large Scale Data Processing Many tasks composed of processing lots of data to produce lots of other data Large-Scale Data Processing Want to use 1000s of CPUs But don’t want hassle of managing things MapReduce provides User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

Stable Storage First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

Distributed File System Chunk Servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers Connects directly to chunkservers to access data

MapReduce Programming Model

What is Map/Reduce Map/Reduce Many problems can be phrased this way Programming model from LISP (and other functional languages) Many problems can be phrased this way Easy to distribute across nodes Imagine 10,000 machines ready to help you compute anything you could cast as a MapReduce problem! This is the abstraction Google is famous for authoring It hides LOTS of difficulty of writing parallel code! The system takes care of load balancing, dead machines, etc. Nice retry/failure semantics

Basic Ideas key1 Map1 reduce1 Key n Source Map i reduce2 data key1 Reduce n Map m Key n (key, value) “indexing”

Programming Concept Map Reduce Perform a function on individual values in a data set to create a new list of values Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25] Reduce Combine values in a data set to create a new value Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)‏

MapReduce Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value)  list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value))  list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usu just one)

Examples

Warm up: Word Count We have a large file of words, Many words in each line Count the number of times each distinct word appears in the file(s)

Word Count using MapReduce map(key = line, value=contents): for each word w in value: emit Intermediate(w, 1) reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each v in intermediate values: result += v emit(key,result)

Word Count, Illustrated map(key=line, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 see bob run see spot throw

MapReduce WordCount Java code

Map Function How it works Input data: tuples (e.g., lines in a text file) Apply user-defined function to process data by keys Output (key, value) tuples The definition of the output keys is normally different from the input Under the hood: The data file is split and sent to different distributed maps (that the user does not know) Results are grouped by key and stored to the local linux file system of the map

Reduce Function How it works Group mappers’ output (key, value) tuples by key Apply a user defined function to process each group of tuples Output: typically, (key, aggregates) Under the hood Each reduce handles a number of keys Reduce pulls the results of assigned keys from maps’ results Each reduce generates one result file in the GFS (or HDFS)

Summary of the Ideas Mapper generates some kind of index for the original data Reducer apply group/aggregate based on that index Flexibility Developers are free to generate all kinds of different indices based on the original data Thus, many different types jobs can be done based on this simple framework

Example: Count URL Access Frequency Work on the log of web page requests (session ID, URL)… Map Input: URLs Output: (URL, 1) Reduce Input (URL, 1) Output (URL, counts)

Example: Reverse Web-link Graph Each source page has links to target pages, find out (target, list (sources)) Map Input (src URL, page content) Output (tgt URL, src URL) Reduce Output (tgt URL, list(src URL)) target urls page