MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Slides:



Advertisements
Similar presentations
Cloud Service Models and Performance Ang Li 09/13/2010.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
SALSA HPC Group School of Informatics and Computing Indiana University.
Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.
SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,
WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
Applied Architectures Eunyoung Hwang. Objectives How principles have been used to solve challenging problems How architecture can be used to explain and.
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author : S. Krishnan, J.-S. Counio Date : Speaker : Sian-Lin Hong IEEE International.
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Amazon Web Services BY, RAJESH KANDEPU. Introduction  Amazon Web Services is a collection of remote computing services that together make up a cloud.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
Operating System for the Cloud Runs applications in the cloud Provides Storage Application Management Windows Azure ideal for applications needing:
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.
SALSA HPC Group School of Informatics and Computing Indiana University.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
SEMINAR ON.  OVERVIEW -  What is Cloud Computing???  Amazon Elastic Cloud Computing (Amazon EC2)  Amazon EC2 Core Concept  How to use Amazon EC2.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Aakash Kag What Why How 1.
Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman
I590 Data Science Curriculum August
Scalable and Worldwide Cloud Platform Powers Expansion for White-Label Mobile TV Solution MINI-CASE STUDY “Microsoft Azure played a vital role in the design.
Data Science Curriculum March
Cloud Distributed Computing Environment Hadoop
CS110: Discussion about Spark
Twister4Azure : Iterative MapReduce for Azure Cloud
Introduction to Apache
Convergence of Big Data and Extreme Computing
Presentation transcript:

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Introduction  Cloud computing combined with cloud infrastructure services  A very viable alternative for scientists  MapReduce  Excellent fault tolerance features  Scalability  Ease of use.  Several options for using MapReduce in cloud environments  MapReduce as a service  Setting up MapReduce cluster on cloud instances  Specialized cloud MapReduce runtimes Take advantage of cloud infrastructure services.

Introduction  Analyze the performance and viability of performing 2 types of bioinformatics computations using MapReduce in cloud environments  Sequence alignment  Sequence assembly  AzureMapReduce  Leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services to provide an decentralized, on demand MapReduce framework  Sustained performance of clouds

Platforms  Apache Hadoop  On BareMetal  On EC2  Amazon Web Services  Elastic MapReduce  Microsoft Azure  AzureMapReduce

Challenges for MapReduce in the clouds  Data storage  Reliability  Master node  Metadata storage  Performance consistency  Communication consistency and scalability  Sustained performance  Choosing suitable instance types  Logging

AzureMapReduce  A solution to the void of parallel programming frameworks on Microsoft Azure  Built on using Azure cloud services  Distributed, highly scalable & highly available services, backed by industrial strength data centers and technologies  Minimal management / maintenance overhead  Reduced footprint  Decentralized control  Ability to dynamically scale up/down

AzureMapReduce Features  Familiar programming model  Fault Tolerance  Co-exist with eventual consistency of cloud services  Easy testing and deployment  Combiner step  Web based monitoring console

AzureMapReduce Architecture

 Starting the Sort & Reduce phases,  When all the map tasks are finished &  When a reduce task is finished downloading all the intermediate data products  No guarantee when all the intermediate data will appear in Task tables  Map Tasks store the number of reduce data products it generated for each reduce task

Performance  Parallel efficiency  AzureMapReduce  Azure small instances – Single Core  Hadoop Bare Metal -IBM iDataplex cluster  EMR & Hadoop on EC2  Cap3 – HighCPU Extra Large (8 Cores, 20 CU)  SWG – Extra Large (4 Cores, 8 CU)

Sequence Alignment  Smith-Waterman-GOTOH to calculate all- pairs dissimilarity OutFile1 OutFile2 OutFile3 OutFile4

Sequence Alignment Performance

Seqeunce Assembly  Assemble sequences using Cap3  Pleasingly parallel  Map Only

Sequence Assembly Performance

Sustained performance of clouds

Thanks  ce/ ce/