Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Scalable High Performance Dimension Reduction
SALSA HPC Group School of Informatics and Computing Indiana University.
Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.
Spark: Cluster Computing with Working Sets
SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.
SALSASALSASALSASALSA Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November Judy Qiu
Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
SALSASALSA Programming Abstractions for Multicore Clouds eScience 2008 Conference Workshop on Abstractions for Distributed Applications and Systems December.
SALSASALSASALSASALSA High Performance Biomedical Applications Using Cloud Technologies HPC and Grid Computing in the Cloud Workshop (OGF27 ) October 13,
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.
Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.
SALSA HPC Group School of Informatics and Computing Indiana University.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
MPI and MapReduce CCGSC 2010 Flat Rock NC September Geoffrey Fox
Parallel Applications And Tools For Cloud Computing Environments SC 10 New Orleans, USA Nov 17, 2010.
SALSA Group’s Collaborations with Microsoft SALSA Group Principal Investigator Geoffrey Fox Project Lead Judy Qiu Scott Beason,
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
SALSA and Cheminformatics SALSA Group February
SALSA HPC Group School of Informatics and Computing Indiana University.
SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSA HPC Group School of Informatics and Computing Indiana University Workshop on Petascale Data Analytics: Challenges, and.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Digital Science Center II
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
I590 Data Science Curriculum August
Applying Twister to Scientific Applications
Data Science Curriculum March
Biology MDS and Clustering Results
Scalable Parallel Interoperable Data Analytics Library
Twister4Azure : Iterative MapReduce for Azure Cloud
Adaptive Interpolation of Multidimensional Scaling
Parallel Applications And Tools For Cloud Computing Environments
Iterative and non-Iterative Computations
Presentation transcript:

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen Tak-lon Wu, Hui Li, Yuduo Zhou, Bingjing Zhang, Adam Lee Hughes, Saliya Ekanayake, Jong Youl Choi, Seung-Hee Bae, Yang Ruan SALSA group, Pervasive Technology Institute, Indiana University, Bloomington, Indiana Advisor : Professor Geoffrey Fox and Professor Judy Qiu SALSA PROJECTS  A simple parallel BLAST application based on Twister MapReduce framework  Runs on a single machine, a cluster, or Amazon EC2 cloud platform  Adaptable to the latest BLAST tool (BLAST )  Uses the state-of-the-art binary invoking parallelism, fully utilize highly optimized stand-alone BLAST software since it is  Brings scalability and simplicity to program and database maintenance.  Query is partitioned and transmitted to all nodes.  Database is replicated to all the nodes before execution  Database is compressed before replication and transported through Twister File Tool  Efficient processing of large scale PageRank challenges current MapReduce runtimes.  Difficulties: messaging > memory > computation  Implementations: Twister, DryadLINQ, Hadoop, MPI  Optimization strategies Load partition data in memory Fit partition size in memory Local merge in Reduce stage  Results Visualization with PlotViz3  10K 3D vertices processed with MDS  Implement with DryadLINQ with 50 million web pages on a 32 nodes Windows HPC cluster  Level of granularity Coarse granularity: split whole web graph into 256 files. Fine granularity: split whole web graph into 1280 files  Implement with Twister and Hadoop with 50 million web pages.  Twister caches the partitions of web graph in memory during multiple iteration, while Hadoop need reload partition from disk to memory for each iteration. A Decentralized MapReduce Framework Built On Top Of Windows Azure Cloud Services.  A solution to the void of parallel programming frameworks on Microsoft Azure  Use distributed, highly scalable & available cloud services  Supports dynamically scaling up/down  No Single Point of Failure  Comparable Performance  Fault Tolerance  Combiner step  Web based monitoring console  Easy testing and deployment  Co-exist with eventual consistency of cloud infrastructure services  Minimal management / maintenance overhead Smith Waterman Sequence Alignment All-Pairs Normalized Performance CAP3 Sequence Assembly Absolute Parallel Efficiency

Parallel Applications And Tools For Cloud Computing Environments SALSA Portal and Biosequence Analysis Workflow PlotViz Visualization with parallel MDS/GTM Retrieve Results Submit Microsoft HPC Cluster Distribute Job Write Results Job Configuration and Submission Tool Cluster Head-node Compute Nodes Sequence Aligning Pairwise Clustering Dimension Scaling PlotViz - 3D Visualization Tool Create Biosequence Analysis Job > Alu Sequences Pairwise Alignment & Distance Calculation Distance Matrix Pairwise Clustering Multi- Dimensional Scaling Visualization Cluster Indices Coordinates 3D Plot The goal of a biosequence workflow is to automate the process by which scientists analyze large groups of sequences. In this case, sequences are clustered in some meaningful way, and the results are transformed into three- dimensional space for visualization. The SALSA Portal presents a set of web services for interacting with HPC resources. One of the Portal’s high- level use cases is an encapsulation of the complete biosequence workflow discussed here. Implementation of the SALSA Portal use cases is accomplished through a set of tiered WCF services. The SALSA biosequence workflow consists of a configuration builder and.NET versions of sequence alignment, pairwise clustering, and dimensional scaling software. The results are visualized using PlotViz. encapsulates implements  A tool for visualizing data points Dimension reduction by GTM and MDS Browse large and high-dimensional data Use many open (value-added) data  Parallel Visualization Algorithms GTM (Generative Topographic Mapping) MDS (Multi-dimensional Scaling) Interpolation extensions to GTM and MDS System Architecture of PlotViz Solvent-screening study This visualizes a result of GTM dimension reduction for 215 solvents used in a pharmaceutical pre- screening process along with 100,000 chemical compounds. The result shows that our tool can clearly separate solvents from other chemicals based on the structural characteristics and users can navigate the large chemical space with visualization. Screenshot of PlotViz CTD data visualization Visualized about 930,000 gene and disease-related chemical compounds in PubChem database by using both MDS (left) and GTM (right) algorithms and labeled as different colors to discover cause-and-effect associations between genes and diseases based on Comparative Toxicogenomics Database (CTD) dataset. SALSA PROJECTS Thilina Gunarathne, Stephen Tak-lon Wu, Hui Li, Yuduo Zhou, Bingjing Zhang, Adam Lee Hughes, Saliya Ekanayake, Jong Youl Choi, Seung-Hee Bae, Yang Ruan SALSA group, Pervasive Technology Institute, Indiana University, Bloomington, Indiana Advisor : Professor Geoffrey Fox and Professor Judy Qiu