Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA. 2010 ACM Speaker: Jia Bao Lin.

Slides:

Advertisements

Similar presentations

SLA-Oriented Resource Provisioning for Cloud Computing

Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.

SALSA HPC Group School of Informatics and Computing Indiana University.

Low Cost, Scalable Proteomics Data Analysis Using Amazon's Cloud Computing Services and Open Source Search Algorithms Brian D. Halligan, Ph.D. Medical.

Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.

SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.

Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.

SALSASALSASALSASALSA Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

DAvinCi: A Cloud Computing Framework for Service Robots

WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.

3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June San Jose Geoffrey Fox, Shantenu Jha, Dan Katz,

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.

Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.

SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.

MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

SALSA HPC Group School of Informatics and Computing Indiana University.

Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.

SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox

Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -

SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-2.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman

Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.

Applying Twister to Scientific Applications

Biology MDS and Clustering Results

Twister4Azure : Iterative MapReduce for Azure Cloud

Clouds from FutureGrid’s Perspective

Cloud versus Cloud: How Will Cloud Computing Shape Our World?

Presentation transcript:

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin 1

 Introduction  Cloud Technologies and Application Architecture  Applications  Performance  Related Works  Conclusion & Future Work 2

 Scientists are overwhelmed with the increasing amount of data processing needs arising from the storm of data that is flowing through virtually every field of science.  Preprocessing, processing and analyzing these large amounts of data is a unique very challenging problem, yet opens up many opportunities for computational as well as computer scientists.  Cloud computing offerings by major commercial players provide on demand computational services over the web, which can be purchased within a matter of minutes simply by use of a credit card. 3

 Scientists has the ability to increase the throughput of their computations by horizontally scaling the compute resources without incurring additional cost overhead.  In this paper we introduce a set of abstract frameworks constructed using the cloud oriented programming models to perform pleasingly parallel computations.  We use Amazon Web Services and Microsoft Windows Azure cloud computing platforms and use Apache Hadoop Map Reduce and Microsoft DryaLINQ as the distributed parallel computing frameworks. 4

 Amazon Web Services (AWS) are a set of cloud computing services by Amazon, offering on demand compute and storage services including but not limited to Elastic Compute Cloud (EC2), Simple Storage Service (S3) and Simple Queue Service (SQS). 5

6  Microsoft Azure platform is a cloud computing platform offering a set of cloud computing services similar to the Amazon Web Services. Windows Azure compute, Azure Storage Queues and Azure Storage blob services are the Azure counterparts for Amazon EC2, Amazon SQS and the Amazon S3 services.

7

8

9

10  Dryad is a framework developed by Microsoft Research as a general-purpose distributed execution engine for coarse-grain parallel applications.  Similar to the Map Reduce frameworks, the Dryad scheduler optimizes the data transfer overheads by scheduling the computations near data and handles failures through rerunning of tasks and duplicate instance execution.

11  Cap3 is a sequence assembly program which assembles DNA sequences by aligning and merging sequence fragments to construct whole genome sequences.  Cap3 is relatively not memory intensive compared to the interpolation algorithms.  Size of a typical data input file for Cap3 program and the result data file range from hundreds of kilobytes to few megabytes.

12  Multidimensional Scaling (MDS) and Generative Topographic Mapping (GTM) are dimension reduction algorithms which finds an optimal user-defined low-dimensional representation out of the data in high-dimensional space.  The GTM interpolation application is high memory intensive and requires large amount of memory proportional to the size of the input data.  The MDS interpolation is less memory bound compared to GTM interpolation, but is much more CPU intensive than the GTM interpolation.

13 Cap3GTM

14  T(1) is the best sequential execution time for the application in a particular environment using the same data set or a representative subset if the sequential time is prohibitively large to measure.  T(ρ) is the parallel run time for the application while “p” is the number of processor cores used.

15  Per core per computation time is calculated in each test to give an idea about the actual execution times in the different environments.

16

17

18 Hadoop Cap3 parallel efficiency using shared file system vs. HDFS on a 768 core (24 core X 32 nodes) cluster Hadoop Cap3 performance shared FS vs. HDFS

19 GTM interpolation efficiency on 26 Million PubChem data points GTM interpolation performance on 26.4 Million PubChem data set

20 DryadLINQ MDS interpolation performance on a 768 core cluster (32 node X 24 cores) Azure MDS interpolation performance on 24 small Azure instances

21  There exist many studies of using existing traditional scientific applications and benchmarks on the cloud.  In one of our earlier works we analyzed the overhead of virtualization and the effect of inhomogeneous data on the cloud oriented programming frameworks.  There are other bio-medical applications developed using Map Reduce programming frameworks such as CloudBLAST, which implements BLAST algorithm and CloudBurst, which performs parallel genome read mappings.

 Cloud infrastructure based models as well as the Map Reduce based frameworks offered good parallel efficiencies given sufficiently coarser grain task decompositions.  The higher level MapReduce paradigm offered a simpler programming model. Also by using two different kinds of applications we showed that selecting an instance type which suits your application can give significant time and monetary advantages.  The cost effectiveness of cloud data centers combined with the comparable performance reported here suggests that loosely coupled science applications will increasingly be implemented on clouds and that using MapReduce frameworks will offer convenient user interfaces with little overhead. 22