A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
SkewTune: Mitigating Skew in MapReduce Applications
Spark: Cluster Computing with Working Sets
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Distributed Computations
PRASHANTHI NARAYAN NETTEM.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
PMIT-6102 Advanced Database Systems
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
1 The Google File System Reporter: You-Wei Zhang.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Introduction to Hadoop and HDFS
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.
MRPGA : An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter :古乃卉.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
TensorFlow– A system for large-scale machine learning
Big Data is a Big Deal!.
Curator: Self-Managing Storage for Enterprise Clusters
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PA an Coordinated Memory Caching for Parallel Jobs
Applying Twister to Scientific Applications
MapReduce Simplied Data Processing on Large Clusters
MapReduce: Data Distribution for Reduce
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
COS 518: Distributed Systems Lecture 11 Mike Freedman
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science

B ACKGROUND MapReduce programming model Simplifies the design and implementation of certain parallel algorithms. Original MapReduce model Need not support updates and consistency. Allows efficient and simple parallelization with static input data. Concurrent random-access write operations occur in the extended MapReduce models. Original MapReduce model does not provide consistency. The model does not apply to iterative or online computations. Extensions of the model have been proposed. The model does not apply to iterative or online computations. Extensions of the model have been proposed. We propose to use in-memory storage for extended MapReduce. We propose to use in-memory storage for extended MapReduce.

I N - MEMORY STORAGE Avoid the latency of disk accesses and can support consistent data updates. Increase the flexibility of the MapReduce parallel programming model without requiring additional communication facilities to propagate data updates. Transparently replicates data to computing nodes and is able to guarantee nontrivial data consistency. Demonstrate the utility of distributed in-memory storage for extended MapReduce EMR Framework Demonstrate the utility of distributed in-memory storage for extended MapReduce EMR Framework

OUTLINE The MapReduce Model Data Access in MapReduce Consistency and Fault Tolerance in Extended MapReduce The EMR Framework

T HE MAP R EDUCE MODEL Two important properties: The simplicity of parallelizing adequate computations Applicability to many common data processing problems. It is not suitable for problems which are not embarrassingly parallel. Extensions to the original model have been proposed. Extended MapReduce models: (1)Enable online aggregation of data (2) Allow continuous queries for online processing of data streams. (3) Support long-running computations, requiring stronger data consistency and fault tolerance. Extended MapReduce models: (1)Enable online aggregation of data (2) Allow continuous queries for online processing of data streams. (3) Support long-running computations, requiring stronger data consistency and fault tolerance.

E XTENDED M AP R EDUCE Extended MR increase the original MapReduce May update data instead of immutable input data and written-once intermediate data. Workers do not necessarily run map and reduce phase in lock-step Implement extended MapReduce models using data-oriented communication In-memory storage enables data-oriented communication for extended MapReduce Allows for adaptive caching Makes intermediate results accessible to all nodes without requiring explicit exchange of messages

T HE EMR FRAMEWORK -- IN - MEMORY FRAMEWORK FOR EXTENDED M AP R EDUCE Includes job management, storage abstractions and synchronization facilities. ECRAM implements a distributed transactional memory (DTM) for strong data consistency. Storage descriptors offers data access in a generic manner. The different execution models require specialized synchronization.

Architecture Extended MapReduce run- time environment The underlying ECRAM in- memory storage A utility library Run-time includes job management, abstractions storage and synchronization facilities Run-time includes job management, abstractions storage and synchronization facilities ECRAM virtually store all data. Enables MapReduce applications to access data in a locationtransparent and fault- tolerant way. ECRAM virtually store all data. Enables MapReduce applications to access data in a locationtransparent and fault- tolerant way. The utility library implements generic data structures.

E XECUTION MODEL The master and the workers create storage objects dynamically. Map and Reduce jobs emerge on demand The master assigns jobs to workers using a job queue. ECRAM stores computational tasks, job descriptors, job queue and node information blocks. Master calls mapreduce to start a MapReduce run Worker calls job_run to run jobs

J OB SCHEDULER Implements scheduling based on ECRAM’s update notifications Represents all processing nodes and all jobs as in- memory objects. Nodes have work queues and there is a global work queue for jobs Runtime system takes advantage of the wait/notify mechanism Transaction properties

EXECUTION OF DIFFERENT M AP R EDUCE Conventional MapReduce jobs An integer object to hold the number of finished map jobs. One map phase after one reduce phase Iterative MapReduce Creates map and reduce jobs alternately The worker nodes run exactly the same code as with conventional MapReduce. Online MapReduce Requires less job synchronization Create map and reduce jobs as soon as input data is available

F AULT T OLERANCE ECRAM implements fault tolerance using atomic transactions and replicated object. EMR builds upon ECRAM’s fault tolerance mechanisms Operations are implemented as transactions A job is scheduled to run on a node Assign it a timestamp. If an ancient timestamp identifies a lagging job, the node will be removed from the worker queue The job will be re-scheduled on another node.

EVALUATION A cluster system consisting of 33 computing nodes, each equipped with 2 AMD Opteron processors and 2 GB ccNUMA RAM. Ran with one master node and a varying numbers of workers. Kept the problem size constant.

The map phase: Node scans his partition of the input file-No concurrency Scales quite well in the number of nodes The reduce phase: All nodes write to the final results concurrently Worker’s increasing leading to more conflict.

R EAL - TIME R AYTRACING Raytracing transforms a 3D scene graph into a 2D image. Trace each ray separately. Raytracing improve an image iteratively. Nodes can update the image of a changing scene,making a case for online MapReduce. Master node splits the 2D image plane into equally- sized partitions The reduce phase simply collates the distinct partitions to a complete image.

The reduce phase is almost constant. The image computation during the map phase is almost inverse to the number of nodes

CONCLUSION The original MapReduce model does not apply to certain computational problems. Extensions of the original MapReduce model necessitate caring for consistency and reliable execution once again. In-memory storage benefits consistency and fault handling, giving rise to the idea of an in-memory framework for extended MapReduce.