Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Slides:



Advertisements
Similar presentations
Scalable High Performance Dimension Reduction
Advertisements

SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A I590 Data Science Curriculum August Geoffrey Fox
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Ch 4. The Evolution of Analytic Scalability
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.
Presenter: Yang Ruan Indiana University Bloomington
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Towards a Collective Layer in the Big Data Stack Thilina Gunarathne Judy Qiu
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
Matthew Winter and Ned Shawa
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -
SALSASALSA Harp: Collective Communication on Hadoop Judy Qiu, Indiana University.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
SPIDAL Java High Performance Data Analytics with Java on Large Multicore HPC Clusters
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Panel: Beyond Exascale Computing
Digital Science Center II
Status and Challenges: January 2017
Hadoop-Harp Applications Performance Analysis on Big Red II
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Interactive Website (
Distinguishing Parallel and Distributed Computing Performance
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Digital Science Center I
Introduction to Spark.
I590 Data Science Curriculum August
Applying Twister to Scientific Applications
High Performance Big Data Computing in the Digital Science Center
Convergence of HPC and Clouds for Large-Scale Data enabled Science
Data Science Curriculum March
Biology MDS and Clustering Results
Tutorial Overview February 2017
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Scalable Parallel Interoperable Data Analytics Library
Ch 4. The Evolution of Analytic Scalability
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Towards High Performance Data Analytics with Java
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Convergence of Big Data and Extreme Computing
Presentation transcript:

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication, which is an essential element in many iterative algorithms. We introduce the “Harp library” to improve the expressiveness and high performance in Big Data processing. This library provides a common set of data abstractions and related collective communication abstractions to transform Map-Reduce programming models into Map- Collective models, thereby addressing large collectives which are a distinctive feature of data intensive and data mining applications. Harp is an open source project from Indiana University that builds on our earlier work, Twister and Twister4Azure. We implemented Harp as a library that plugs into Hadoop and enables users to run complex data analysis and machine learning algorithms on both clouds and supercomputers. The Scaling by Majorizing a Complicated Function (SMACOF) MDS algorithm is known to be fast and efficient. DA-SMACOF can reduce the time cost and find global optima by using deterministic annealing. The drawback is it assumes all weights are equal to one for all input distance matrices. To remedy this we added a weighting function to the SMACOF function, called WDA-SMACOF. Harp is the runtime platform for an NSF- funded DIBBs project that we have just started in order to produce many more scalable parallel data analytics capabilities. This will enable the Globus genomics pipeline to offer additional analytics through these new libraries with top performance. We can package our system as services to interface with Globus genomics. Judy Qiu, Bingjing Zhang, Thomas Wiggins Indiana University Implementing High Performance Computing with the Apache Big Data Stack: Experience with Harp ABSTRACT BACKGROUND CONCLUSIONS REFERENCES The Harp plugin is currently supported by Hadoop and Hadoop Harp architecture is an extension on next generation MapReduce frameworks with Yarn resource manager, providing support to MapCollective applications (see figures). We built Map-Collective as a unified model to improve the performance and expressiveness of big data tools. We run Harp on K-means, Graph Layout, and Multidimensional Scaling algorithms with realistic application datasets over 4096 cores on the IU BigRed II Supercomputer (Cray/Gemini) where we have achieved linear speedup. EXPERIMENT RESULTS [1] J. Qiu, S. Jha, A. Luckow, G. Fox, Towards HPC- ABDS: An Initial High-Performance Big Data Stack, accepted to the proceedings of ACM 1st Big Data Interoperability Framework Workshop: Building Robust Big Data ecosystem, NIST special publication, March 13-21, [2] B. Zhang, Y. Ruan, J. Qiu. Harp: Collective Communication on Hadoop, Proceedings of IEEE International Conference on Cloud Engineering (IC2E 2015) Harp demonstrates the portability of HPC-ABDS to HPC and eventually Exascale systems. With this plug-in, Map-Reduce jobs can be transformed into Map-Collective jobs. For the first time, Map- Collective brings high performance to the Apache Big Data Stack in a clear communication abstraction, which did not exist before in the Hadoop ecosystem. We expect Harp to equal MPI performance with straightforward optimizations. K-means Clustering M M MM allreduce centroids Force-directed Graph Drawing Algorithm M M MM allgather positions of vertices WDA-SMACOF MMMM allreduce the stress value allgather and allreduce results in the conjugate gradient process With the increase in both volume and complexity of data nowadays, a runtime environment needs to integrate with community infrastructure which supports interoperable, sustainable and high performance data analytics. One solution is to converge Apache Big Data stack with a High Performance Cyberinfrastructure (HPC-ABDS) into well-defined and implemented common building blocks, providing richness in capabilities and productivity. HPC-ABDS aims to provide them in a library form, so that they can be reused by higher- level applications and tuned for specific domain problems like Machine Learning. HIGH PERFORMACE DATA ANALYTICS