Towards High Performance Data Analytics with Java

Slides:



Advertisements
Similar presentations
Scalable High Performance Dimension Reduction
Advertisements

SALSA HPC Group School of Informatics and Computing Indiana University.
SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.
SALSASALSASALSASALSA Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November Judy Qiu
Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
SALSASALSASALSASALSA High Performance Biomedical Applications Using Cloud Technologies HPC and Grid Computing in the Cloud Workshop (OGF27 ) October 13,
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Study of Biological Sequence Structure: Clustering and Visualization & Survey on High Productivity Computing Systems (HPCS) Languages SALIYA EKANAYAKE.
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
Evaluating the Performance of MPI Java in FutureGrid Nigel Pugh 2, Tori Wilbon 2, Saliya Ekanayake 1 1 Indiana University 2 Elizabeth City State University.
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.
Presenter: Yang Ruan Indiana University Bloomington
Yang Ruan PhD Candidate Computer Science Department Indiana University.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
S CALABLE H IGH P ERFORMANCE D IMENSION R EDUCTION Seung-Hee Bae.
SALSA HPC Group School of Informatics and Computing Indiana University.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.
SALSA Group’s Collaborations with Microsoft SALSA Group Principal Investigator Geoffrey Fox Project Lead Judy Qiu Scott Beason,
SALSA HPC Group School of Informatics and Computing Indiana University.
SCALABLE AND ROBUST DIMENSION REDUCTION AND CLUSTERING
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
Yang Ruan PhD Candidate Salsahpc Group Community Grid Lab Indiana University.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
SPIDAL Analytics Performance February 2017
Digital Science Center II
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions  Introduction.
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
Digital Science Center I
Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu
I590 Data Science Curriculum August
Applying Twister to Scientific Applications
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
Biology MDS and Clustering Results
DACIDR for Gene Analysis
Overview Identify similarities present in biological sequences and present them in a comprehensible manner to the biologists Objective Capturing Similarity.
Data Science for Life Sciences Research & the Public Good
Scalable Parallel Interoperable Data Analytics Library
Adaptive Interpolation of Multidimensional Scaling
Evaluation of Java Message Passing in High Performance Data Analytics
Digital Science Center III
Indiana University, Bloomington
Department of Intelligent Systems Engineering
PHI Research in Digital Science Center
Big Data, Simulations and HPC Convergence
Iterative and non-Iterative Computations
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Towards High Performance Data Analytics with Java SALIYA EKANAYAKE sekanaya@cs.indiana.edu 4/1/2013 SALSA Presentation

A Bit of Background Gene Sequence Clustering and Visualization Projects Million sequence project http://salsahpc.indiana.edu/millionseq/ Work on COG (Protein) sequences http://salsacog.blogspot.com/ Work on phylogenetic trees http://salsafungiphy.blogspot.com/ Publications G. L. H. Yang Ruan, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox, “Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions,” in C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, 2014 L. Stanberry, R. Higdon, W. Haynes, N. Kolker, W. Broomall, S. Ekanayake, A. Hughes, Y. Ruan, J. Qiu, E. Kolker, and G. Fox, “Visualizing the protein sequence universe,” in Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences, Delft, The Netherlands, 2012, pp. 13-22 Y. Ruan, S. Ekanayake, M. Rho, H. Tang, S.-H. Bae, J. Qiu, and G. Fox, “DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences,” in Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, Orlando, Florida, 2012, pp. 329-336 A. Hughes, Y. Ruan, S. Ekanayake, S. H. Bae, Q. Dong, M. Rho, J. Qiu, and G. Fox, “Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets,” BMC Bioinformatics, vol. 13 Suppl 2, pp. S9, 2012 Gene Sequences >G0H13NN01D34CL GTCGTTTAAGCCATTACGTC … >G0H13NN01DK2OZ GTCGTTAAGCCATTACGTC … Determine, Represent, and Verify Clusters Sequence Cluster 2 1 … Represent and Verify Visualize in 3D Generate Phylogenetic Trees Compared to Traditional 2D 4/1/2013 SALSA Presentation

Alignment and Distance Calculation Under the Hood # X Y Z 0.358 0.262 0. 295 1 0.252 0.422 0.372 Algorithms Alignment and Distance Calculation SALSA-SWG  C# MPI SALSA-SWG-MBF  C# MPI SALSA-NW-MBF  C# MPI SALSA-SWG-MBF2Java  Java MapReduce SALSA-NW-BioJava  Java MapReduce Dimension Reduction MDSasChisq  C# MPI DA-SMACOF  C# MPI Twister DA-SMACOF  Java Iterative MapReduce WDA-SMACOF  Java Iterative MapReduce Clustering DAPWC  C# MPI DAVS  C# MPI >G0H13NN01D34CL GTCGTTTAAGCCATTACGTC … >G0H13NN01DK2OZ GTCGTTAAGCCATTACGTC … Dimension Reduction D3 Alignment and Distance Calculation Visualization D1 D2 D5 Clustering D4 # Cluster 1 3 Reality Is More Complex Study of Biological Sequence Structure http://salsahpc.blogspot.com/2013/05/study-of-biological-sequence-structure.html Million Sequence Processes http://salsahpc.indiana.edu/millionseq/fungi2/fungi2_index.html Runs On Tempest  Windows HPC Cluster FutureGrid, BigRed II, Quarry  Traditional Linux Based HPC Clusters 4/1/2013 SALSA Presentation

Towards Java Motivation Options “Java Ready” Applications Immediate  Limited Windows HPC Clusters Future  Integrate with Apache Big Data Stack (ABDS) Options Keep C# Run on Azure cloud  Not the best for MPI because of high latencies and low bandwidths Run on Mono  We tried, it worked, but poor in performance Convert to Java Time consuming, but gained good results “Java Ready” Applications Deterministic Annealing Vector Sponge (DAVS) Deterministic Annealing Pairwise Clustering (DAPWC) 4/1/2013 SALSA Presentation

Evaluations MPI Frameworks Kernel Benchmarks  Your code was here!! MPI.NET  A high performance message passing interface for .NET environment FastMPJ  A pure Java implementation of mpiJava 1.2 specification OpenMPI  Java wrapper for native MPI implementation Nightly snapshot 1.9a1r28881 (OMPI-nightly) – conforms with mpiJava 1.2 specification Source tree revision 30301 (OMPI-trunk) Release candidate version 1.7.5rc5 (OMPI-175rc5) – latest of the three Kernel Benchmarks Ohio MicroBenchmark (OMB) Suite Send and receive Allreduce Application Benchmarks DAVS and DAPWC on Real Data Parallel Patterns of T x P x N T - # threads per process P - # MPI processes per node N - # nodes Threads from Habanero Java Library Mainly for Parallel Loops  Your code was here!! 4/1/2013 SALSA Presentation

Kernel Benchmarks MPI Send and Receive Performance with Different MPI Frameworks OMPI-trunk Performance with and without Infiniband 4/1/2013 SALSA Presentation

Kernel Benchmarks MPI Allreduce Performance with Different MPI Frameworks OMPI-trunk Performance with and without Infiniband 4/1/2013 SALSA Presentation

DAVS Performance Mode – Charge5 Pure MPI MPI with Threads Pure MPI Speedup 4/1/2013 SALSA Presentation

DAVS Performance Mode – Charge2 Pure MPI MPI with Threads Pure MPI Speedup 4/1/2013 SALSA Presentation

DAVS Performance Single Node Charge 2, Charge 5 and Charge 6 Points OMPI-trunk performed the best and OMPI-nightly was near too MPI.NET may be suffering from bad Infiniband FastMPJ had issues that prevented it from running the applications Performance with threading is not up to expected for Java Charge 2 Charge 5 Charge 6 4/1/2013 SALSA Presentation

DAPWC Performance OMPI-175 Only (Chosen over OMPI-trunk) 4/1/2013 SALSA Presentation

DAPWC Performance Parallelism  16 4/1/2013 SALSA Presentation

DAPWC Performance Speedup Points Performance with threads is better than DAVS, but Tx1xN is peculiar FastMPJ failed as before MPI.NET and OMPI-nightly runs are yet to perform 4/1/2013 SALSA Presentation

Current Tasks and Future Complete migration of applications to Java Evaluate performance Investigate “not so great” thread performance Future How to integrate with ABDS? Provide SaaS? 4/1/2013 SALSA Presentation

Thank you! 4/1/2013 SALSA Presentation