Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
MapReduce.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Christopher Olston and many others Yahoo! Research Programming and Debugging Large-Scale Data Processing Workflows.
Problem-solving on large-scale clusters: theory and applications Lecture 3: Bringing it all together.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) August 31, 2010 Lecture 3  2010, I. Gupta.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce M/R slides adapted from those of Jeff Dean’s.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
Lecture 3: Bringing it all together
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Parallel Computing Parallel efficiency with p processors Traditional parallel computing: focus on compute intensive tasks often ignores disk read and write focus on inter-processor n/w communication overheads assumes a “shared-nothing” model

Parallel Tasks on Large Distributed Files Files are distributed in a GFS-like system Files are very large – many terabytes Reading and writing to disk (GFS) is a significant part of T Computation time per data item are not large All data can never be in memory, so appropriate algorithms are needed

MapReduce MapReduce is both a programming model and a clustered computing system A specific way of formulating a problem, which yields good parallelizability esp in the context of large distributed data A system which takes a MapReduce-formulated problem and executes it on a large cluster Hides implementation details, such as hardware failures, grouping and sorting, scheduling …

Word-Count using MapReduce Problem: determine the frequency of each word in a large document collection

Map: document -> word-count pairs Reduce: word, count-list -> word-count-total

General MapReduce Formulation of a Problem Preprocesses a set of files to generate intermediate key-value pairs As parallelized as you want Group: Partitions intermediate key-value pairs by unique key, generating a list of all associated values Reduce: For each key, iterates over value list Performs computation that requires context between iterations Parallelizable amongst different keys, but not within one key A common question is: how is map reduce different from cloning the input data N ways, and using N workers to process the data using the original non-MR formulation/

MapReduce Parallelization: Execution Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

MapReduce Parallelization: Pipelining Finely granular tasks: many more map tasks than machines Better dynamic load balancing Minimizes time for fault recovery Can pipeline the shuffling/grouping while maps are still running Example: 2000 machines -> 200,000 map + 5000 reduce tasks Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

MR Runtime Execution Example The following slides illustrate an example run of MapReduce on a Google cluster A sample job from the indexing pipeline, processes ~900 GB of crawled pages

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (1 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (2 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (3 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (4 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (5 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (6 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (7 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (8 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation MR Runtime (9 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Examples: MapReduce @ Facebook Types of Applications: Summarization Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement Ad hoc Analysis Eg: how many group admins broken down by state/country Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Spam Detection Anomalous patterns Application api usage patterns Ad Optimization Too many to count ..

SQL Join using MapReduce

HaDoop MapReduce (Yahoo!) Data is stored in HDFS (Hadoop’s version of GFS) or disk Hadoop MR interface: The fm and fr are function objects (classes) Class for fm implements the Mapper interface Map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) Class for fr implements the Reducer interface reduce(WritableComparable key, Iterator values, Hadoop takes the generated class files and manages running them

Pig Latin and Hive: MR Languages Pig Latin – Yahoo! Hive - Facebook

Find sessions that end with the “best” page. Example using Pig Find sessions that end with the “best” page. Visits Pages user url time Amy www.cnn.com 8:00 www.crap.com 8:05 www.myblog.com 10:00 www.flickr.com 10:05 Fred cnn.com/index.htm 12:00 url pagerank www.cnn.com 0.9 www.flickr.com www.myblog.com 0.7 www.crap.com 0.2 . . . . . .

In Pig Latin Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';

Pig Latin vs. Map-Reduce Map-reduce welds together 3 primitives: process records  create groups  process groups a = FOREACH input GENERATE flatten(Map(*)); b = GROUP a BY $0; c = FOREACH b GENERATE Reduce(*); In Pig, these primitives are: explicit independent fully composable Pig adds primitives for: filtering tables projecting tables combining 2 or more tables more natural programming model optimization opportunities

Find users who tend to visit “good” pages. Example cont. Find users who tend to visit “good” pages. Transform to (user, Canonicalize(url), time) Load Pages(url, pagerank) Visits(user, url, time) Join url = url Group by user to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5

Load Visits(user, url, time) Load Pages(url, pagerank) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (www.cnn.com, 0.9) (www.snails.com, 0.4) Transform to (user, Canonicalize(url), time) Join url = url (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) Group by user (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) }) (Fred, { (Fred, www.snails.com, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgPR) (Amy, 0.65) (Fred, 0.4) Filter avgPR > 0.5 (Amy, 0.65)

Exercise (in groups) Generate at least 50K random sentences of max length 140 characters from a set of 20-30 words Challenge version: download at least 50K tweets using Twitter’s APIs Find all sets of sentences that are 90% similar to each other, i.e. 90% of the words match Formulate using MapReduce and implement in parallel Challenge version: use Google Scholar to find an efficient algorithm for the above (it exists) Challenge ++: implement the above in parallel using MR (Use Hadoop on AWS)

Parallel Efficiency of MR Execution time on single processor: Parallel execution efficiency on P processors Therefore is important leading to the need for an additional intermediate ‘combine’ stage

Word-Count using MapReduce Mappers are also doing a ‘combine’ by computing the local word count in their respective documents