Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

Slides:

Advertisements

Similar presentations

Scalable High Performance Dimension Reduction

Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

SALSA HPC Group School of Informatics and Computing Indiana University.

Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.

SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.

Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.

Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.

SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.

SALSASALSASALSASALSA Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox

SALSASALSASALSASALSA Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November Judy Qiu

Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.

SALSASALSA Programming Abstractions for Multicore Clouds eScience 2008 Conference Workshop on Abstractions for Distributed Applications and Systems December.

SALSASALSASALSASALSA High Performance Biomedical Applications Using Cloud Technologies HPC and Grid Computing in the Cloud Workshop (OGF27 ) October 13,

Introduction to Amazon Web Services Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake.

Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Overview of Cloud Technologies and Parallel Programming Frameworks for Scientific Applications Original Author: Thilina Gunarathne Indiana University

SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1

Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.

SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University

On the Varieties of Clouds for Data Intensive Computing 董耀文 Antslab Robert L. Grossman University of Illinois at Chicago And Open Data.

SALSASALSASALSASALSA Cloud Technologies and Their Applications March 26, 2010 Indiana University Bloomington Judy Qiu

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Introduction to Hadoop and HDFS

Amazon Web Services BY, RAJESH KANDEPU. Introduction  Amazon Web Services is a collection of remote computing services that together make up a cloud.

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.

FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.

Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.

SALSASALSASALSASALSA MSR Internship – Final Presentation Jaliya Ekanayake School of Informatics and Computing Indiana University.

Windows Azure Conference 2014 Designing Applications for Scalability.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.

SALSASALSASALSASALSA CloudComp 09 Munich, Germany Jaliya Ekanayake, Geoffrey Fox School of Informatics and Computing Pervasive.

SALSA HPC Group School of Informatics and Computing Indiana University.

SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.

SALSA Group’s Collaborations with Microsoft SALSA Group Principal Investigator Geoffrey Fox Project Lead Judy Qiu Scott Beason,

SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox

SALSA HPC Group School of Informatics and Computing Indiana University.

Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.

SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -

SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu

SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-

Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {

SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu

Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.

Applying Twister to Scientific Applications

Biology MDS and Clustering Results

SC09 Doctoral Symposium, Portland, 11/18/2009

Twister4Azure : Iterative MapReduce for Azure Cloud

Adaptive Interpolation of Multidimensional Scaling

Parallel Applications And Tools For Cloud Computing Environments

Clouds from FutureGrid’s Perspective

Cloud versus Cloud: How Will Cloud Computing Shape Our World?

Presentation transcript:

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics, Pervasive Technology Institute Indiana University

Introduction Forth Paradigm – Data intensive scientific discovery – DNA Sequencing machines, LHC Loosely coupled problems – BLAST, Monte Carlo simulations, many image processing applications, parametric studies Cloud platforms – Amazon Web Services, Azure Platform MapReduce Frameworks – Apache Hadoop, Microsoft DryadLINQ

Cloud Computing On demand computational services over web – Spiky compute needs of the scientists Horizontal scaling with no additional cost – Increased throughput Cloud infrastructure services – Storage, messaging, tabular storage – Cloud oriented services guarantees – Virtually unlimited scalability

Amazon Web Services Elastic Compute Service (EC2) – Infrastructure as a service Cloud Storage (S3) Queue service (SQS) Instance TypeMemory EC2 compute units Actual CPU cores Cost per hour Large7.5 GB42 X (~2Ghz)0.34$ Extra Large15 GB84 X (~2Ghz)0.68$ High CPU Extra Large7 GB208 X (~2.5Ghz)0.68$ High Memory 4XL68.4 GB268X (~3.25Ghz)2.40$

Microsoft Azure Platform Windows Azure Compute – Platform as a service Azure Storage Queues Azure Blob Storage Instance Type CPU Cores MemoryLocal Disk Space Cost per hour Small11.7 GB250 GB0.12$ Medium23.5 GB500 GB0.24$ Large47 GB1000 GB0.48$ ExtraLarge815 GB2000 GB0.96$

Classic cloud architecture

MapReduce General purpose massive data analysis in brittle environments – Commodity clusters – Clouds Apache Hadoop – HDFS Microsoft DryadLINQ

MapReduce Architecture Map() Reduce Results Optional Reduce Phase HDFS Input Data Set Data File Executable

Cap3 – Sequence Assembly Assembles DNA sequences by aligning and merging sequence fragments to construct whole genome sequences Increased availability of DNA Sequencers. Size of a single input file in the range of hundreds of KBs to several MBs. Outputs can be collected independently, no need of a complex reduce step.

Sequence Assembly Performance with different EC2 Instance Types

Sequence Assembly in the Clouds Cap3 parallel efficiency Cap3 – Per core per file (458 reads in each file) time to process sequences

Cost to assemble to process 4096 FASTA files * Amazon AWS total :11.19 $ Compute 1 hour X 16 HCXL (0.68$ * 16)= $ SQS messages = 0.01 $ Storage per 1GB per month = 0.15 $ Data transfer out per 1 GB = 0.15 $ Azure total : $ Compute 1 hour X 128 small (0.12 $ * 128) = $ Queue messages = 0.01 $ Storage per 1GB per month = 0.15 $ Data transfer in/out per 1 GB = 0.10 $ $ Tempest (amortized) : 9.43 $ – 24 core X 32 nodes, 48 GB per node – Assumptions : 70% utilization, write off over 3 years, including support * ~ 1 GB / reads (458 reads X 4096)

GTM & MDS Interpolation Finds an optimal user-defined low-dimensional representation out of the data in high-dimensional space – Used for visualization Multidimensional Scaling (MDS) – With respect to pairwise proximity information Generative Topographic Mapping (GTM) – Gaussian probability density model in vector space Interpolation – Out-of-sample extensions designed to process much larger data points with minor trade-off of approximation.

GTM Interpolation performance with different EC2 Instance Types EC2 HM4XL best performance. EC2 HCXL most economical. EC2 Large most efficient

Dimension Reduction in the Clouds - GTM interpolation GTM Interpolation parallel efficiency GTM Interpolation–Time per core to process 100k data points per core 26.4 million pubchem data DryadLINQ using a 16 core machine with 16 GB, Hadoop 8 core with 48 GB, Azure small instances with 1 core with 1.7 GB.

Dimension Reduction in the Clouds - MDS Interpolation DryadLINQ on 32 nodes X 24 Cores cluster with 48 GB per node. Azure using small instances

Acknowlegedments SALSA Group ( – Jong Choi – Seung-Hee Bae – Jaliya Ekanayake & others Chemical informatics partners – David Wild – Bin Chen Amazon Web Services for AWS compute credits Microsoft Research for technical support on Azure & DryadLINQ

Thank You!! Questions?