Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Overview of MapReduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Framework and Its Applications CSCI5570 Large Scale Data Processing Systems Instructor: James Cheng, CSE, CUHK Slide Ack.: modified based on the.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Hadoop Framework and Its Applications
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Distributed System Gang Wu Spring,2018.
Introduction to MapReduce
COS 518: Distributed Systems Lecture 11 Mike Freedman
Map Reduce, Types, Formats and Features
Presentation transcript:

Brief Overview on Bigdata, Hadoop, MapReduce Jianer Chen CSCE-629, Fall 2015

A Lot of Data Google processes 20 PB a day (2008) Wayback Machine has 3 PB TB/month (03/2009) – 9.6 PB recently Facebook processes 500 TB/day (08/2012) eBay has > 10 PB of user data + 50 TB/day (01/2012) CERN Data Centre has over 100 PB of physics data. KB (kilobyte) = 10 3 bytes;MB (megabyte) = 10 6 bytes; GB (gigabyte) = 10 9 bytes;TB (terabyte) = bytes; PB (petabyte) = bytes

20+ billion web pages x 20KB = 400+ TB - one computer reads MB/sec from disk, so it will take more than 4 months to read the web pages - 1,000 hard drives to store the web pages Not scalable: takes even more to do something useful with the data! A standard architecture for such problems has emerged - Cluster of commodity Linux nodes - Commodity network (ethernet) to connect them Google Example A Lot of Data

4 Cluster Architecture: Many Machines Switch 1 Gbps between nodes in a rack 2-10 Gbps backbone between racks …… …… Each rack has nodes Google had 1 million machines in 2011.

Hadoop Cluster DN: data node TT: task tracker NN: name node From: Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Cluster Architecture: Many Machines Hadoop Cluster

Cluster Computing: A Classical Algorithmic Ideas: Divide-and Conquer work 1 work partition work 2 work 3 work 4 “worker” result 1 result 2result 3result 4 result combine solve

Challenges in Cluster Computing

How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? Challenges in Cluster Computing

How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What is the common theme of all of these problems? Challenges in Cluster Computing

How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? What is the common theme of all of these problems? Parallelization problems arise from: - Communication between workers (e.g., to exchange state) - Access to shared resources (e.g., data) We need a synchronization mechanism. Challenges in Cluster Computing

We need the right level of abstraction – new model more appropriate for the multicore/cluster environment Hide system-level details from the developers – no more race conditions, lock contention, etc. Separating the what from how – developer specifies the computation that needs to be performed – execution framework handles actual execution Therefore,

We need the right level of abstraction – new model more appropriate for the multicore/cluster environment Hide system-level details from the developers – no more race conditions, lock contention, etc. Separating the what from how – developer specifies the computation that needs to be performed – execution framework handles actual execution Therefore, This motivated MapReduce

MapReduce: Big Ideas

Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart) MapReduce: Big Ideas

Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart) Data movements are expensive in supercomputers MapReduce moves processing to data (leverage locality) MapReduce: Big Ideas

Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart) Data movements are expensive in supercomputers MapReduce moves processing to data (leverage locality) Disk I/O is time-consuming MapReduce organizes computation into long streaming operations MapReduce: Big Ideas

Failures are common in Cluster systems MapReduce implementation copes with failures (auto task restart) Data movements are expensive in supercomputers MapReduce moves processing to data (leverage locality) Disk I/O is time-consuming MapReduce organizes computation into long streaming operations Developing distributed software is difficult MapReduce isolates developers from implementation details. MapReduce: Big Ideas

Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Typical Large-Data Problem

Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Typical Large-Data Problem map

Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Typical Large-Data Problem map Reduce

Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate results Aggregate intermediate results Generate final output Typical Large-Data Problem map Reduce Key idea of MapReduce: provide a functional abstraction for these two operations. [Dean and Ghemawat, OSDI 2004]

map input map MapReduce: Greneral Framework reduce …… output

Shuffle and Sort map Output written to DFS map MapReduce: Greneral Framework reduce …… input InputSplit

Shuffle and Sort map Output written to DFS map MapReduce: Greneral Framework reduce …… input InputSplit User specified System provided

Programmers specify two functions: map (k 1, v 1 ) → (k 2, v 2 )* reduce (k 2, v 2 *) → (k 3, v 3 )* All values with the same key are sent to the same reducer The execution framework handles everything else. MapReduce

Programmers specify two functions: map (k 1, v 1 ) → (k 2, v 2 )* reduce (k 2, v 2 *) → (k 3, v 3 )* All values with the same key are sent to the same reducer The execution framework handles everything else. MapReduce Example: Word Count Map(String docID, String text): map(docID, text) → (word, 1)* for each word w in text: Emit( w, 1) Reduce(String word, Iterator values): int sum = 0; for each v in values: reduce(word, [1, …, 1]) → (word, sum)* sum += v ; Emit(word, sum); k1k1 v1v1 k3k3 v3v3 k2k2 k2k2 v2*v2* v2v2

Shuffle and Sort: aggregate values by keys map Output written to DFS docIDtext map a 1 b 1 a 1 c 1 b 1 c 1 c 1 a 1 a 1 a 1 c 1 b 1 a b c reduce a5 b3 c4 Example: Word Count Map(String docID, String text): for each word w in text: Emit( w, 1) Reduce(String word, Iterator values): int sum = 0; for each v in values: sum += v ; Emit(word, sum); MapReduce: Word Count

Handles scheduling – Assigns workers to map and reduce tasks Handles “data distribution” – Moves processes to data Handles synchronization – Gathers, sorts, and shuffles intermediate data Handles errors and faults – Detects worker failures and restarts Everything happens on top of a distributed file system MapReduce: Framework

Programmers specify two functions: map (k 1, v 1 ) → (k 2, v 2 )* reduce (k 2, v 2 *) → (k 3, v 3 )* – all values with the same key are sent to the same reducer MapReduce: User Specification

Programmers specify two functions: map (k 1, v 1 ) → (k 2, v 2 )* reduce (k 2, v 2 *) → (k 3, v 3 )* – all values with the same key are sent to the same reducer Mappers & Reducers can specify any computation – be careful with access to external resources! MapReduce: User Specification

Programmers specify two functions: map (k 1, v 1 ) → (k 2, v 2 )* reduce (k 2, v 2 *) → (k 3, v 3 )* – all values with the same key are sent to the same reducer Mappers & Reducers can specify any computation – be careful with access to external resources! The execution framework handles everything else MapReduce: User Specification

Programmers specify two functions: map (k 1, v 1 ) → (k 2, v 2 )* reduce (k 2, v 2 *) → (k 3, v 3 )* – all values with the same key are sent to the same reducer Mappers & Reducers can specify any computation – be careful with access to external resources! The execution framework handles everything else Not quite… often, programmers also specify: partition (k 2, number of partitions) → partition for k 2 – often a simple hash of the key, e.g., hash(k 2 ) mod n – divides up key space for parallel reduce operations MapReduce: User Specification

Programmers specify two functions: map (k 1, v 1 ) → (k 2, v 2 )* reduce (k 2, v 2 *) → (k 3, v 3 )* – all values with the same key are sent to the same reducer Mappers & Reducers can specify any computation – be careful with access to external resources! The execution framework handles everything else Not quite… often, programmers also specify: partition (k 2, number of partitions) → partition for k 2 – often a simple hash of the key, e.g., hash(k 2 ) mod n – divides up key space for parallel reduce operations combine (k 2, v 2 ) → (k 2 ’, v 2 ’) – mini-reducers that run in memory after the map phase – used as an optimization to reduce network traffic MapReduce: User Specification

map docIDtext map Shuffle and Sort: aggregate values by keys a b c reduce a5 b3 c4 Example: Word Count Map(String docID, String text): for each word w in text: H[w] = H[w] + 1; for each word w in H Emit( w, H[w]) Reduce(String word, Iterator values): int sum = 0; for each v in values: sum += v ; Emit(word, sum); MapReduce: Word Count InputSplit a 1 b 1 a 1 c 1 b 1 c 1 c 1 a 1 a 1 a 1 c 1 b 1 combine a 2 b 1 a c 2 b 1 c c 1 a 2 a a 1 c 1 b 1 partition

Example: Shortest-Path

Data structure: The adjacency list (with edge weights) for the graph Each vertex v has a Node ID Let A v be the set of neighbors of v Let d v be the current distance from source to v

Example: Shortest-Path Data structure: The adjacency list (with edge weights) for the graph Each vertex v has a Node ID Let A v be the set of neighbors of v Let d v be the current distance from source to v Basic ideas: Original input is (s, [0, A s ]);

Example: Shortest-Path Data structure: The adjacency list (with edge weights) for the graph Each vertex v has a Node ID Let A v be the set of neighbors of v Let d v be the current distance from source to v Basic ideas: Original input is (s, [0, A s ]); On an input (v, [d v, A v ]), Mapper emits pairs whose key (i.e., vertex) is in A v, with a distance associated with d v

Example: Shortest-Path Data structure: The adjacency list (with edge weights) for the graph Each vertex v has a Node ID Let A v be the set of neighbors of v Let d v be the current distance from source to v Basic ideas: Original input is (s, [0, A s ]); On an input (v, [d v, A v ]), Mapper emits pairs whose key (i.e., vertex) is in A v, with a distance associated with d v On an input (v, [d v, A v ]*), Reducer emits a pair (v, [d v, A v ]) with the minimum distance d v.

Example: Shortest-Path Map(v, [d v, A v ]) Emit(v, [d v, A v ]); for each w in A v do Emit(w, [d v +wt(v, w), A w ]); Reduce(v, [d v, A v ]*) d min = +∞; for each [d v, A v ] in [d v, A v ]* if d min > d v then d min = d; Emit(v, [d, A v ]) Data structure: The adjacency list (with edge weights) for the graph Each vertex v has a Node ID Let A v be the set of neighbors of v Let d v be the current distance from source to v

MapReduce iterations – The first time we run the algorithm, we discover all neighbors of the source s – The second iteration, we discover all “2 nd level” neighbors of s – Each iteration expands the “search frontier” by one hop Example: Shortest-Path

MapReduce iterations – The first time we run the algorithm, we discover all neighbors of the source s – The second iteration, we discover all “2 nd level” neighbors of s – Each iteration expands the “search frontier” by one hop The approach is suitable for graphs with small diameter (e.g., the “small-world graphs”) Example: Shortest-Path

MapReduce iterations – The first time we run the algorithm, we discover all neighbors of the source s – The second iteration, we discover all “2 nd level” neighbors of s – Each iteration expands the “search frontier” by one hop The approach is suitable for graphs with small diameter (e.g., the “small-world graphs”) Need a “driver” algorithm to check termination of the algorithm ( in practice: Hadoop counters) Example: Shortest-Path

MapReduce iterations – The first time we run the algorithm, we discover all neighbors of the source s – The second iteration, we discover all “2 nd level” neighbors of s – Each iteration expands the “search frontier” by one hop The approach is suitable for graphs with small diameter (e.g., the “small-world graphs”) Need a “driver” algorithm to check termination of the algorithm ( in practice: Hadoop counters) Can be extended to including the actual path. Example: Shortest-Path

Store graphs as adjacency lists; Graph algorithms with MapReduce: -- Each Map task receives a vertex and its outlinks; -- Map task computes some function of the link structure and then gives a value with target as the key; -- Reduce task collects these keys (target vertices) and aggregates Graph Iterate multiple MapReduce cycles until some termination condition -- graph structure is passed from one iteration to next The idea can be used to solve other graph problems Summary: MapReduce Graph Algorithms

46 CSCE-629 Course Summary Basic notations, concepts, and techniques Data manipulation Graph algorithms and applications Computational optimization NP-completeness theory

47 CSCE-629 Course Summary Basic notations, concepts, and techniques Data manipulation Graph algorithms and applications Computational optimization NP-completeness theory Pseudo-code for algorithms Big-Oh notation Divide-and-conquer Dynamic programming Solving recurrence relations

48 CSCE-629 Course Summary Basic notations, concepts, and techniques Data manipulation Graph algorithms and applications Computational optimization NP-completeness theory Data structures, algorithms, complexity Heap 2-3 trees Hashing Union-Find Finding median

49 CSCE-629 Course Summary Basic notations, concepts, and techniques Data manipulation Graph algorithms and applications Computational optimization NP-completeness theory DFS and BFS, and simple applications Connected components Topological sorting Strongly connected components Longest path in DAG

50 CSCE-629 Course Summary Basic notations, concepts, and techniques Data manipulation Graph algorithms and applications Computational optimization NP-completeness theory Maximum bandwidth paths Dijkstra’s algorithm (shortest path) Kruskal’s algorithm (MST) Bellman-Ford algorithm (shortest path) Matching in bipartite graphs Sequence alignment

51 CSCE-629 Course Summary Basic notations, concepts, and techniques Data manipulation Graph algorithms and applications Computational optimization NP-completeness theory P and polynomial-time computation Definition of NP, membership in NP Polynomial-time reducibility NP-hardness and NP-completeness Proving NP-hardness and NP-completeness NP-complete problems: SAT, IS, VC, Partition