Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.

Slides:



Advertisements
Similar presentations
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Advertisements

Scalable High Performance Dimension Reduction
SALSA HPC Group School of Informatics and Computing Indiana University.
SALSASALSASALSASALSA Applying Twister for Scientific Applications NSF Cloud PI Workshop March 17, 2011 Judy Qiu School of Informatics.
Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis Jong Youl Choi, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox.
SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.
SALSASALSASALSASALSA Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November Judy Qiu
SALSASALSASALSASALSA Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading Workshop on Enabling Data-Intensive.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
SALSASALSA Programming Abstractions for Multicore Clouds eScience 2008 Conference Workshop on Abstractions for Distributed Applications and Systems December.
SALSASALSASALSASALSA High Performance Biomedical Applications Using Cloud Technologies HPC and Grid Computing in the Cloud Workshop (OGF27 ) October 13,
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Big Data Challenge Mega 10^6 Giga 10^9 Tera 10^12 Peta 10^15 Pig Latin.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Generative Topographic Mapping in Life Science Jong Youl Choi School of Informatics and Computing Pervasive Technology Institute Indiana University
SALSASALSASALSASALSA Cloud Technologies and Their Applications March 26, 2010 Indiana University Bloomington Judy Qiu
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
Presenter: Yang Ruan Indiana University Bloomington
Yang Ruan PhD Candidate Computer Science Department Indiana University.
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.
SALSA HPC Group School of Informatics and Computing Indiana University.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
MPI and MapReduce CCGSC 2010 Flat Rock NC September Geoffrey Fox
Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.
Parallel Applications And Tools For Cloud Computing Environments SC 10 New Orleans, USA Nov 17, 2010.
SALSA Group’s Collaborations with Microsoft SALSA Group Principal Investigator Geoffrey Fox Project Lead Judy Qiu Scott Beason,
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
SALSA HPC Group School of Informatics and Computing Indiana University.
SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSA HPC Group School of Informatics and Computing Indiana University Workshop on Petascale Data Analytics: Challenges, and.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox
Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions  Introduction.
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
Applying Twister to Scientific Applications
DACIDR for Gene Analysis
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
SC09 Doctoral Symposium, Portland, 11/18/2009
Agent-based Model Simulation with Twister
Scientific Data Analytics on Cloud and HPC Platforms
Adaptive Interpolation of Multidimensional Scaling
Towards High Performance Data Analytics with Java
Iterative and non-Iterative Computations
Presentation transcript:

Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011

16S rRNA Sequence Diversity and Abundance 1 Pyrosequencing capable of generating millions of sequence reads from environmental samples Computational goal is classification and grouping of sequences * JE Clarridge, “Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases.” Clin Microbiol Rev Oct;17(4) *

Sequence Clustering Gene Sequences Pairwise Alignment & Distance Calculation Distance Matrix Pairwise Clustering Multi- Dimensional Scaling Visualization Cluster Indices Coordinates 3D Plot Smith-Waterman / Needleman-Wunsch with Kimura2 / Jukes-Cantor / Percent-Identity MPI.NET Implementation Chi-Square / Deterministic Annealing C# Desktop Application based on VTK * Note. The implementations of Smith-Waterman and Needleman-Wunsch algorithms are from Microsoft Biology Foundation libraryMicrosoft Biology Foundation O(NxN)

It Works … …But … Visualization of MDS and clustering results for gene sequences from an environmental sample. The many different genes are classified by a clustering algorithm and visualized by MDS dimension reduction.

… There are limitations Distance calculation prohibitive for large N MDS prohibitive for large N Job management difficult for large N

MDS Interpolation m in-sample N-m out-of-sample N-m out-of-sample Training Interpolation Trained data Interpolated MDS Map Interpolated MDS Map O(MxM) O(Mx(N-M)) 4

Implementations support:  Splitting of data  Passing the output of map functions to reduce functions  Sorting the inputs to the reduce function based on the intermediate keys  Quality of services Map(Key, Value) Reduce(Key, List ) Data Partitions Reduce Outputs A hash function maps the results of the map tasks to r reduce tasks A Parallel Runtime From Information Retrieval Map Reduce 5

Twister Programming Model configureMaps(..) configureReduce(..) runMapReduce(..) while(condition){ } //end while updateCondition() close() Combine() operation Reduce() Map() Worker Nodes Communications/data transfers via the pub-sub broker network & direct TCP Iterations May send pairs directly Local Disk Cacheable map/reduce tasks  Main program may contain many MapReduce invocations or iterative MapReduce invocations Main program’s process space 6

Computational Advantages Dimensional scaling allows visual identification of cluster. Interpolative MDS greatly reduces computational complexity and memory requirements by utilizing pre-mapping results of in-sample subset.

Scaled-up Sequence Clustering Gene Sequences (N = 1 Million+) Distance Matrix Interpolative MDS with Pairwise Distance Calculation Multi- Dimensional Scaling (MDS) Visualization 3D Plot Reference Sequence Set (M = 100K) N - M Sequence Set (900K) Select Reference Reference Coordinates x, y, z N - M Coordinates x, y, z Pairwise Alignment & Distance Calculation O(MxM) O(MxM) O(Mx(N-M)) * Note. This implementation of the Needleman-Wunsch algorithm is based on the BioJava libraryBioJava

Results – 100K Sequences Full MDS MDS – 50K Interpolated MDS – 90K Interpolated

100K Metagenomics Sequences - Full MDS

100K Metagenomics Sequences – 50K Interpolated Points

100K Metagenomics Sequences – 90K Interpolated Points

Total Wallclock Time (hours) Number of Interpolated Points * For 100K sequences, running on 90 nodes (720 cores) of Polar Grid Quarry

Conclusions Mutlidimensional Scaling can be used to visually identify sequence clusters and direct more detailed studies. SMACOF-MDS requires only dissimilarity between sequences, not Euclidean distances or feature vectors. Interpolation can dramatically decrease computational complexity while yielding reasonable results. Optimal interpolation patterns need to be determined.

Conclusions Twister supports iterative algorithms (like MDS) and eases multi-thousand job control.

Future Directions Hierarchical MDS MDS by Deterministic Annealing MDS method comparison Scale-up to 20M+ sequences

Acknowledgements Salsa Group Dr. Geoffrey Fox Dr. Judy Qiu Seung-Hee Bae Jong Youl Choi Jaliya Ekanayake (Microsoft) Saliya Ekanayake Thilina Gunarathne Bingjing Zhang Hui Li Yang Ruan Yuduo Zhou Tak-Lon Wu Collaborators Mina Rho, Indiana University Qunfeng Donng, University of North Texas This work is supported by NIH ARRA funding.

Bibliography 1.Yijun Sun, et al. ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucl. Acids Res. (2009) first published online May 5, 2009 doi: /nar/gkp Xiaohong Qiu, Geoffrey C. Fox (presenter), Huapeng Yuan, Seung-Hee Bae, George Chrysanthakopoulos, Henrik Frystyk Nielsen PARALLEL CLUSTERING AND DIMENSIONAL SCALING ON MULTICORE SYSTEMS Invited talk at the 2008 High Performance Computing & Simulation Conference (HPCS 2008) In Conjunction With The 22nd EUROPEAN CONFERENCE ON MODELLING AND SIMULATION (ECMS 2008) Nicosia, Cyprus June 3 - 6, 2008.PARALLEL CLUSTERING AND DIMENSIONAL SCALING ON MULTICORE SYSTEMSHPCS Seung-Hee Bae Parallel Multidimensional Scaling Performance on Multicore Systems at workhop on Advances in High-Performance E-Science Middleware and Applications in Proceedings of eScience 2008 Indianapolis IN December Parallel Multidimensional Scaling Performance on Multicore SystemsworkhopeScience Seung-Hee Bae, Jong Youl Choi, Judy Qiu, Geoffrey Fox Dimension Reduction and Visualization of Large High-dimensional Data via InterpolationProceedings of ACM HPDC 2010 conference, Chicago, Illinois, June 20-25, 2010.Dimension Reduction and Visualization of Large High-dimensional Data via InterpolationHPDC 5.Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox Twister: A Runtime for Iterative MapReduce March Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference, Chicago, Illinois, June 20-25, 2010.Twister: A Runtime for Iterative MapReduce Workshop HPDC 6.Jaliya Ekanayake Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing Indiana University PhD Exam December Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

Potential Cost Savings (Sequence Length ~500)

Multidimensional Scaling (MDS)* o Given the proximity information among points. o Optimization problem to find mapping in target dimension of the given data based on pairwise proximity information while minimize the objective function. o Objective functions: STRESS (1) or SSTRESS (2) o Only needs pairwise distances  ij between original points (typically not Euclidean) o d ij (X) is Euclidean distance between mapped (3D) points * I. Borg and P. J. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, New York, NY, U.S.A., 2005.