SALSA Group Research Activities April 27, 2011. Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
SALSA HPC Group School of Informatics and Computing Indiana University.
SALSASALSASALSASALSA Applying Twister for Scientific Applications NSF Cloud PI Workshop March 17, 2011 Judy Qiu School of Informatics.
Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.
SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.
SALSASALSASALSASALSA Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November Judy Qiu
Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
SALSASALSASALSASALSA High Performance Biomedical Applications Using Cloud Technologies HPC and Grid Computing in the Cloud Workshop (OGF27 ) October 13,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
SALSASALSASALSASALSA Hybrid Cloud and Cluster Computing Paradigms for Scalable Data Intensive Applications April 15, 2011 University of Alabama Judy Qiu.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
School of Informatics and Computing Indiana University
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
SALSASALSASALSASALSA Cloud Technologies and Their Applications March 26, 2010 Indiana University Bloomington Judy Qiu
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
Presenter: Yang Ruan Indiana University Bloomington
FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Implications of Clouds for Data Intensive Science with application to Biomedical Science I400 Indiana University March Geoffrey Fox
SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
SALSASALSASALSASALSA Scalable Programming and Algorithms for Data Intensive Life Science Applications Data Intensive Seattle, WA Judy Qiu
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Parallel Applications And Tools For Cloud Computing Environments SC 10 New Orleans, USA Nov 17, 2010.
SALSA Group’s Collaborations with Microsoft SALSA Group Principal Investigator Geoffrey Fox Project Lead Judy Qiu Scott Beason,
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
SALSA HPC Group School of Informatics and Computing Indiana University.
Towards a Collective Layer in the Big Data Stack Thilina Gunarathne Judy Qiu
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Performance of MapReduce on Multicore Clusters
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
SALSASALSASALSASALSA IU Twister Supports Data Intensive Science Applications School of Informatics and Computing Indiana University.
SALSA HPC Group School of Informatics and Computing Indiana University Workshop on Petascale Data Analytics: Challenges, and.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
Applying Twister to Scientific Applications
MapReduce for Data Intensive Scientific Analyses
Biology MDS and Clustering Results
DACIDR for Gene Analysis
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
Scientific Data Analytics on Cloud and HPC Platforms
Twister4Azure : Iterative MapReduce for Azure Cloud
Parallel Applications And Tools For Cloud Computing Environments
Group 15 Swathi Gurram Prajakta Purohit
Towards High Performance Data Analytics with Java
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

SALSA Group Research Activities April 27, 2011

Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications  NIH Projects  Bioinformatics  Workflow  Data Visualization – GTM/MDS/PlotViz  Education

Twister & Azure MapReduce

What is Twister?  Twister is an Iterative MapReduce Framework which supports  Customized static input data partition  Cacheable map/reduce tasks  Combining operation to converge intermediate outputs to main program  Fault recovery between iterations

Twister Programming Model

Twister Architecture

Applications and Performance

MapReduceRoles for Azure  MapReduce framework for Azure Cloud  Built using highly-available and scalable Azure cloud services  Distributed, highly scalable & highly available services  Minimal management / maintenance overhead  Reduced footprint  Hides the complexity of cloud & cloud services from the users  Co-exist with eventual consistency & high latency of cloud services  Decentralized control  avoids single point of failure

MapReduceRoles for Azure Supports dynamically scaling up and down of the compute resources. Fault Tolerance Combiner step Web based monitoring console Easy testing and deployment

Twister for Azure  Iterative MapReduce Framework for Microsoft Azure Cloud.  Merge Step  In-Memory Caching of static data  Cache aware hybrid scheduling using Queues as well as using a bulletin board Kmeans Performance with/without data caching.

Performance Comparisons BLAST Sequence Search Cap3 Sequence Assembly Smith Watermann Sequence Alignment Kmeans Scaling speedup Kmeans Increasing number of iterations

Dryad & Parallel Applications

DryadLINQ CTP Evaluation  The beta version released on Dec 2010  Motivation:  Evaluate key features and interface in DryadLINQ  Study parallel programming model in DryadLINQ  Three applications  SW-G bioinformatics application  Matrix Matrix Multiplication  PageRank

Parallel programming model  DryadLINQ store input data as DistributedQuery objects  It splits distributed objects into partitions with following APIs:  AsDistributed()  RangePartition() Common LINQ providers ProviderBase class LINQ-to-objects IEnumerable PLINQ ParallelQuery LINQ-to-SQL IQueryable LINQ-to-? IQueryable DryadLINQ DistributedQuery

Matrix-Matrix Multiplication  Parallel programming algorithms  Row split  Row Column split  2 dimensional block decomposition in Fox algorithm  Multi core technologies in.NET  TPL, PLINQ, Thread pool  Hybrid parallel model  Port multi-core to Dryad task to improve performance

PageRank  Grouped Aggregation  A core primitive of many distributed programming models.  Two stage:1) Partition the data into groups by some keys 2) Performs an aggregation over each groups  DryadLINQ provide two types of grouped aggregation  GroupBy(), without partial aggregation optimization.  GroupAndAggregate(), with partial aggregation.

NIH Projects

Sequence Clustering Gene Sequences Pairwise Alignment & Distance Calculation Distance Matrix Pairwise Clustering Multi- Dimensional Scaling Visualization Cluster Indices Coordinates 3D Plot Smith-Waterman / Needleman-Wunsch with Kimura2 / Jukes-Cantor / Percent-Identity MPI.NET Implementation Chi-Square / Deterministic Annealing C# Desktop Application based on VTK * Note. The implementations of Smith-Waterman and Needleman-Wunsch algorithms are from Microsoft Biology Foundation library

Scale-up Sequence Clustering with Twister Gene Sequences (N = 1 Million) Distance Matrix Interpolative MDS with Pairwise Distance Calculation Multi- Dimensional Scaling (MDS) Visualization 3D Plot Reference Sequence Set (M = 100K) N - M Sequence Set (900K) Select Reference Reference Coordinates x, y, z N - M Coordinates x, y, z Pairwise Alignment & Distance Calculation O(MxM) O(MxM) O(Mx(N-1)) e.g. 25 Million

Services and Support  Web Portal and Metadata Management  CGB work  // todo - Ryan

GTM vs. MDS GTM MDS (SMACOF) Maximize Log-Likelihood Minimize STRESS or SSTRESS Objective Function O(KN) (K << N) O(N 2 ) Complexity Non-linear dimension reduction Find an optimal configuration in a lower-dimension Iterative optimization method Purpose EM Iterative Majorization (EM-like) Optimization Method Optimization Method Vector-based data Non-vector (Pairwise similarity matrix) Input

PlotViz 23 Visualization Algorithms Chem2Bio2RDF PlotViz Parallel dimension reduction algorithms Aggregated public databases 3-D Map File SPARQL query Meta data Light-weight client PubChem CTD DrugBank QSAR

Education

SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09 Pub/Sub Broker Network Summarizer Switcher Monitoring Interface iDataplex Bare- metal Nodes XCAT Infrastructure Virtual/Physical Clusters Monitoring & Control Infrastructure iDataplex Bare-metal Nodes (32 nodes) iDataplex Bare-metal Nodes (32 nodes) XCAT Infrastructure Linux Bare- system Linux Bare- system Linux on Xen Windows Server 2008 Bare-system SW-G Using Hadoop SW-G Using DryadLINQ Monitoring Infrastructure Dynamic Cluster Architecture Demonstrate the concept of Science on Clouds on FutureGrid

SALSAHPC Dynamic Virtual Cluster on FutureGrid -- Demo at SC09 Demonstrate the concept of Science on Clouds using a FutureGrid cluster