MAP REDUCE BASICS CHAPTER 2 Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Distributed Computations
Cloud Computing Lecture #2 Introduction to MapReduce Jimmy Lin The iSchool University of Maryland Monday, September 8, 2008 This work is licensed under.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Intro to Map/Reduce Andrew Rau-Chaplin - Adapted from What is Cloud Computing? (and an intro to parallel/distributed processing), Jimmy Lin, The iSchool.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Large-scale file systems and Map-Reduce
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
Lecture 16 (Intro to MapReduce and Hadoop)
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MAP REDUCE BASICS CHAPTER 2

Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads in a core, cores in multi-core processor, multiple processor in a machine, machines in a cluster

History CPUs Single CPU – inserted into single CPU socket on a motherboard Didn’t work well but: – Tried to add other CPU sockets, so multiple CPUs – Need additional hardware to connect them to RAM, resources – Lots of overhead

Hyper-threading Parallel computation to PCs – 2000 Pentium 4 – single CPU core but HT – Appears as 2 logical CPUs – MIMD – But are sharing execution – Need OS to take advantage of it Logical core is the number of physical cores times the number of threads that can run on a machine

Multi-core Additional cores added Single CPU socket, all cores on same chip so less latency – If dual core, CPU has 2 central processing units – Can run 2 processes at same time – Since all on same chip No extra hardware needed to connect to RAM, etc. faster communication

vCPU Virtual Processor VM assigned to a vCPU, share of a physical CPU assigned to a VM vCPU is a series of time slots on logical processors Adding more vCPUs can increase wait time

How to increase vCPUs If a vCPU is presented to OS as single core CPU in single socket, limits number of vCPUs Typically OS restrict number of physical CPUs not logical CPUs Multicore vCPUs - VMware introduced virtual sockets and cores per socket, e.g. 2 virtual sockets and 4 virtual cores per socket, allows 8vCPUs Or can use hyperthreads to increase vCPUs

Multiple threads on a VM If 2 hyper threads each on 4 cores, VM thinks have 8 cores If want all 8 threads at once?

Each thread at 50% utilization Each physical CPU at 100% Avoid creating a single VM with more vCPUs than physical cores If need more, split up VM

Basics MR – abstraction that hides system-level details from programmer Move code to data – Spread data across disks – DFS manages storage

Topics Functional programming MapReduce Distributed file system

Functional Programming Roots MapReduce = functional programming plus distributed processing on steroids – Not a new idea… dates back to the 50’s (or even 30’s) What is functional programming? – Computation as application of functions – Computation is evaluation of mathematical functions – Avoids state and mutable data – Emphasizes application of functions instead of changes in state

Functional Programming Roots How is it different? – Traditional notions of “data” and “instructions” are not applicable – Data flows are implicit in program – Different orders of execution are possible – Theoretical foundation provided by lambda calculus a formal system for function definition Exemplified by LISP, Scheme

Overview of Lisp Functions written in prefix notation where operators precede operands (+ 1 2)  3 (* 3 4)  12 (sqrt ( + (* 3 3) (* 4 4)))  5 (define x 3)  x (* x 5)  15

Functions Functions = lambda (anonymous) expressions bound to variables Example expressed with lambda:(+ 1 2)  3 λxλy.x+y Above expression is equivalent to: Once defined, function can be applied: (define (foo x y) (sqrt (+ (* x x) (* y y)))) (define foo (lambda (x y) (sqrt (+ (* x x) (* y y))))) (foo 3 4)  5

Functional Programming Roots Map and Fold Two important concepts in functional programming – Map: do something to everything in a list – Fold: combine results of a list in some way

Functional Programming Map Higher order functions – accept other functions as arguments – Map Takes a function f and its argument, which is a list applies to all elements in list Returns a list as result Lists are primitive data types – [ ] – [[a 1] [b 2] [c 3]]

Map/Fold in Action Simple map example: (map (lambda (x) (* x x)) [ ])  [ ]

Functional Programming Reduce – Fold Takes function g, which has 2 arguments: an initial value and a list. The g applied to initial value and 1 st item in list Result stored in intermediate variable Intermediate variable and next item in list 2 nd application of g, etc. Fold returns final value of intermediate variable

Map/Fold in Action Simple map example: Fold examples: Write Sum of squares: (map (lambda (x) (* x x)) [ ])  [ ] (fold + 0 [ ])  15 (fold * 1 [ ])  120 (define (sum-of-squares v) // where v is a list (fold + 0 (map (lambda (x) (* x x)) v))) (sum-of-squares [ ])  55

Functional Programming Roots Use map/fold in combination Map – transformation of dataset Fold- aggregation operation Can apply map in parallel Fold – more restrictions, elements must be brought together – Many applications do not require g be applied to all elements of list, fold aggregations in parallel

MapReduce - Functional Programming Roots Input to function, apply function Function emits output – Can use output as input to next stage

MapReduce Map in MapReduce is same as in functional programming Reduce corresponds to fold 2 stages: – User specified computation applied over all input, can occur in parallel, return intermediate output – Output aggregated by another user-specified computation

Mappers/Reducers Key-value pair (k,v) – basic data structure in MR Keys, values – int, strings, etc., user defined – e.g. (k – URLs, v – HTML content) – e.g. (k – node ids, v – adjacency lists of nodes) Map: (k1, v1) -> [(k2, v2)] Reduce: (k2, [v2]) -> [(k3, v2)] Where […] denotes a list Notice output of Map, input to Reduce different

General Flow Apply mapper to every input key-value pair stored in DFS Generate arbitrary number of intermediate (k,v) Group by operation on intermediate keys within mapper (really a sort? But called a shuffle)) Distribute intermediate results by key – not across reducers but across the network (really a shuffle? But called a sort) Aggregate intermediate results Generate final output to DFS – one file per reducer Map Reduce

What function is implemented? 9

Another Example: unigram (word count) (docid, doc) on DFS, doc is text Mapper tokenizes (docid, doc), emits (k,v) for every word – (word, 1) Execution framework all same keys brought together in reducer Reducer – sums all counts (of 1) for word Each reduce writes to one file Words within file sorted, file same # words Can use output as input to another MR

Hadoop libraries for MapReduce

Combine - Bandwidth Optimization Issue: Can be a large number of key-value pairs – Example – word count (word, 1) – If copy across network intermediate data > input

Combine - Bandwidth Optimization Solution: use Combiner functions – allow local aggregation (after mapper) before shuffle sort Word Count - Aggregate (count each word locally) – intermediate = # unique words – Executed on same machine as mapper – no output from other mappers – Results in a “mini-reduce” right after the map phase – (k,v) of same type as input/output – If operation associative and commutative, reduce can be same as combiner – Reduces key-value pairs to save bandwidth

Partitioners – Load Balance Issue: Intermediate results can all be on one reducer Solution: use Partitioner functions – divide up intermediate key space and assign (k,v) to reducers – Specifies task to which copy (k,v) – Reducer processes keys in sorted order – Partitioner applies function to key – Hopefully same number of each to each reducer But may be- Zipfian

MapReduce Programmers specify two functions: map (k, v) → * reduce (k’, v’) → * – All v’ with the same k’ are reduced together Usually, programmers also specify: partition (k’, number of partitions ) → partition for k’ – Often a simple hash of the key, e.g. hash(k’) mod n Where n is the number of reducers – Allows reduce operations for different keys in parallel

Its not just Map and Reduce Apply mapper to every input key-value pair stored in DFS Generate arbitrary number of intermediate (k,v) Aggregate locally Assign to reducers Group by operation on intermediate keys Distribute intermediate results by key not across reducers Aggregate intermediate results Generate final output to DFS – one file per reducer Map Reduce Combine Partition

Execution Framework MapReduce program (job) contains Code for mappers Combiners (optional) Partitioners (optional) Code for reducers Configuration parameters (where is input, store output) – Execution framework takes care of everything else – Developer submits job to submission node of cluster (jobtracker)

Recall these problems? How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? MapReduce takes care of all this

Execution Framework of MapReduce Scheduling – Job divided into tasks (certain block of (k,v) pairs) – Can have 1000s jobs need to be assigned – May exceed number that can run concurrently – Task queue – Coordination among tasks from different jobs

Execution Framework Synchronization – Concurrently running processes join up – Intermediate (k,v) grouped by key, copy intermediate data over network, shuffle/sort Number of copy operations (M mappers, R reducers) Worst case? – M X R copy operations Each mapper may send intermediate results to every reducer – Reduce computation cannot start until all mappers finished, (k,v) shuffled/sorted Differs from functional programming – But can copy intermediate (k,v) over network to reducer when mapper finishes

Execution Framework of MapReduce Speculative execution Map phase only as fast as? – slowest map task Problem: Stragglers, flaky hardware Solution: Use speculative execution: – Exact copy of same task on different machine – Uses result of fastest task in attempt to finish – Better for map or reduce? – Can improve running time by 44% (Google) – Doesn’t help if skewed distribution of values

Execution Framework Data/code co-location – Execute near data – If not possible must stream data Try to keep within same rack

Execution Framework Error/fault handling – The norm – Disk failures, RAM errors, datacenter outages – Software errors – Corrupted data

Map Reduce Implementations: – Google has a proprietary implementation in C++ – Hadoop is an open source implementation in Java (lead by Yahoo now Apache) Hadoop is a framework for storage and large scale processing of data sets on clusters of commodity hardware Walmart uses Hadoop for storage (although they originally broke it – wouldn’t scale as needed) P&G uses HIVE – For data analytics built on top of Hadoop

Differences in MapReduce Implementations Hadoop (Apache) vs. Google – Google – program can specify 2ndary sort, can’t change key in reducer – Hadoop - Values arbitrarily ordered, can change key in reducer Hadoop – Programmer can specify number of map tasks, but framework makes final decision – In reduce, programmer specified number of tasks is used

Hadoop Careful using external resources – e.g. bottleneck querying SQL DB Mappers can emit arbitrary number of intermediate (k,v), can be of different type Reduce can emit arbitrary number of final (k,v) and can be of different type than intermediate (k,v) Different from functional programming, can have side effects (state change internal – may cause problems, external may write to files) MapReduce can have no reduce, but must have mapper – Can just pass identity function to reducer – May not have any input compute pi

Other Sources Other source can serve as source/destination for data from MapReduce – Google – BigTable – Hbase – BigTable clone – Hadoop – integrated RDB with parallel processing, can write to DB tables

File Systems – GFS vs DFS Distributed File System (DFS) – In HPC, storage distinct from computation – NAS (network attached storage) and SAN are common Separate, dedicated nodes for storage – Fetch, load, process, write – Bottleneck Higher performance networks $$ (10G Ethernet), special purpose interconnects $$$ (InfiniBand) – $$ increases non-linearly – In GFS Computation and storage not distinct components

Hadoop Distributed File System - HDFS GFS supports proprietary MapReduce HDFS – supports Hadoop Don’t have to run GFS on DFS, but misses advantages Difference in GFS and HDFS vs. DFS: – Adapted to large data processing – divide user data into chunks/blocks – LARGE (was) – Replicate these across the local disk nodes in cluster – Master-slave architecture

HDFS vs GFS (Google File System) Difference in HDFS: – Master-slave architecture GFS: Master (master), slave (chunkserver) HDFS: master (namenode), slave (datanode) – Master – namespace (metadata, directory structure, file to block mapping, location of blocks, access permission) – Slaves – manage actual data blocks – Client contacts namespace, gets data from slaves, 3 copies of each block, etc. – Block is 64 MB – Initially Files were immutable – once closed cannot be modified

HDFS Namenode – Namespace management – Coordinate file operations Lazy garbage collection – Maintain file system health Heartbeats, under-replication, balancing Supports subset of POSIX API, pushed to application No Security

Hadoop Cluster Architecture HDFS namenode runs daemon Job submission node runs jobtracker – point of contact run MapReduce – Monitors progress of MapReduce jobs, coordinates Mappers and reducers Slaves run tasktracker – Runs users code, datanode daemon, serve HDFS data – Send heartbeat messages to jobtracker

Hadoop Cluster Architecture Number of reduce tasks depends on reducers specified by programmer Number of map tasks depends on – Hint from programmer – Number of input files – Number of HDFS data blocks of files

Hadoop Cluster Architecture Map tasks assigned – (k,v) called input split Input splits computed automatically Aligned on HDFS boundaries so associated with single block, simplifies scheduling Data locality, if not stream across network (same rack if possible)

How can we use MapReduce to solve problems? Refresh your memory on Dijkstra’s algorithm

Hadoop Cluster Architecture Mappers in Hadoop – Javaobjects with a MAP method – Mapper object instantiated for every map task by tasktracker – Life cycle – instantiation, hook in API for program specified code Mappers can load state, static data sources, dictionaries, etc. – After initialization: MAP method called by framework on all (k,v) in input split – Method calls within same Java object, can preserve state across multiple (k,v) in same task – Can run programmer specified termination code

Hadoop Cluster Architecture Reducers in Hadoop – Execution similar to that of mappers Instantiation, initialization, framework calls REDUCE method with intermediate key and iterator over all key values Intermediate keys in sorted order Can preserve state across multiple intermediate keys

CAP Theorem Consistency, availability, partition tolerance Cannot satisfy all 3 Partitioning unavoidable in large data systems, must trade off availability and consistency – If master fails, system is unavailable so consistent! – If multiple masters, more available, but inconsistent Workaround to single namenode – Warm standby namenode – Hadoop community working on it