Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

Slides:

Advertisements

Similar presentations

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

MapReduce Online Veli Hasanov Fatih University.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

MapReduce Online Tyson Condie and Neil Conway UC Berkeley Joint work with Peter Alvaro, Rusty Sears, Khaled Elmeleegy (Yahoo! Research), and Joe Hellerstein.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

Ch 4. The Evolution of Analytic Scalability

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

資訊工程系智慧型系統實驗室 iLab 南台科技大學 1 Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 出處 : 2011 UKSim 5th European Symposium on Computer Modeling.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

HAMS Technologies 1

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop.

CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.

An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Next Generation of Apache Hadoop MapReduce Owen

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Hadoop Aakash Kag What Why How 1.

Introduction to MapReduce and Hadoop

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Cloud Distributed Computing Environment Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

Introduction to Apache

Lecture 16 (Intro to MapReduce and Hadoop)

MAPREDUCE TYPES, FORMATS AND FEATURES

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕

OUTLINE INTRODUCTION LITERATURE SURVEY OUR PROPOSED ARCHITECTURE ADVANTAGES, FEATURES AND APPLICATIONS CONCLUSIONS AND FUTURE WORK

INTRODUCTION Cloud MapReduce (CMR) is a framework for processing large data sets of batch data in cloud. The Map and Reduce phases run sequentially, one after another. This leads to: 1.Compulsory batch processing 2.No parallelization of the map and reduce phases 3.Increased delays.

We propose a novel architecture to support streaming data as input using pipelining between the Map and Reduce phases in CMR, ensuring that the output of the Map phase is made available to the Reduce phase as soon as it is produced.

This ‘Pipelined MapReduce’ approach leads to increased parallelism between the Map and Reduce phases; thereby 1.Supporting streaming data as input 2.Reducing delays 3. Enabling the user to take ‘snapshots’ of the approximate output generated in a stipulated time frame. 4. Supporting cascaded MapReduce jobs.

LITERATURE SURVEY A.MapReduce The model consists of two phases: a Map phase and a Reduce phase. Initially, the data is divided into smaller 'splits'. These splits of data are independent of each other and can hence be processed in parallel fashion. Each split consists of a set of (key, value) pairs that form records.

B. Hadoop Hadoop is an implementation of the MapReduce programming model developed by Apache. The Hadoop framework is used for batch processing of large data sets on a physical cluster of machines.

It incorporates a distributed file system called Hadoop Distributed File System (HDFS), a Common set of commands, scheduler, and the MapReduce evaluation framework. Hadoop is popular for processing huge data sets, especially in social networking, targeted advertisements, internet log processing etc.

C. Cloud MapReduce Cloud MapReduce [2] is a light-weight implementation of MapReduce programming model on top of the Amazon cloud OS, using Amazon EC2 instances.

The architecture of CMR, as described in [2] consists of one input queue, multiple reduce queues which act as staging areas for holding the intermediate (key, value) pairs produced by the Mappers, a master reduce queue that holds the pointers to the reduce queues, and an output queue that holds the final results.

D. Online MapReduce (Hadoop Online Prototype) HOP is a modification to traditional Hadoop framework that incorporates pipelining between the Map and Reduce phases, thereby supporting parallelism between these phases, and providing support for processing streaming data.

A downstream dataflow element can begin processing before an upstream producer finishes. It carries out online aggregation of data to produce incrementally correct output.

OUR PROPOSED ARCHITECTURE Our proposal aims at bridging this gap between heavyweight HOP and the light- weight, scalable Cloud MapReduce implementation, by providing support for processing stream data in Cloud MapReduce.

The challenges involved in the implementation include: 1. Providing support for streaming data at input 2. A novel design for output aggregation 3. Handling Reducer failures 4. Handling windows based on timestamps.

1)The first design option: As in CMR, the Mapper pushes each intermediate (key, value) pair to one of the reduce queues based on hash value of the intermediate key. The hash function can be user-defined or a default function provided, as in CMR.

2) The second design option: Alternatively, we can have a single queue between the Mappers and the Reducers, with all the intermediate (key, value) pairs generated by all the Mappers pushed to this intermediate queue.

Reducers poll this IntermediateQueue for records. Aggregation is carried out as follows: There are a fixed number of Output splits. Whenever a Reducer reads a (key, value) record from the IntermediateQueue, it applies a user-defined Reduce function to the record, to produce an output (key, value) pair.

It then selects an output split by hashing on the output key produced.

3) Hybrid Approach Have multiple ReduceQueues, each linked statically to a particular Reducer, but instead of linking the output splits to the Reducer statically, use hashing on the output key of the (key, value) pair generated by the Reducer to select an output split.

This will require fewer changes to the existing CMR architecture, but will involve static linking of Reducers to ReduceQueues.

ADVANTAGES, FEATURES AND APPLICATIONS A.Advantages 1. Either design allows Reducers to start processing as soon as data is made available by the Mappers. This allows parallelism between the Map and Reduce phases. 2. A downstream processing element can start processing as soon as some data is available from an upstream element. 3. The network is better utilized as data is continuously pushed from one phase to the next. 4. The final output is computed incrementally. 5. Introduction of a pipeline between the Reduce phase of one job and the Map phase of the next job will support Cascaded MapReduce jobs.

B. Features 1)Time windows 2) Snapshots 3) Cascaded MapReduceJobs

C. Applications With these features, the design is particularly suited to stream processing of data. Typically analysis and processing of web feeds, click-streams, micro-blogging, and stock market quotes are some of the popular and typical stream processing applications

CONCLUSIONS AND FUTURE WORK It is also inherently scalable as it is cloud- based. This also gives it a „light-weight‟ nature, as the handling of distributed resources is done by the Cloud OS. Future work will include supporting rolling windows for obtaining outputs of arbitrary time-intervals of the input stream.