Big Data, Simulations and HPC Convergence

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
Advertisements

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Data Science at Digital Science October Geoffrey Fox Judy Qiu
51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center 1.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Department of Intelligent Systems Engineering
Geoffrey Fox Panel Talk: February
Panel: Beyond Exascale Computing
Digital Science Center
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
SPIDAL Analytics Performance February 2017
Digital Science Center II
Department of Intelligent Systems Engineering
Structure of Problems and its Relation to Software and Hardware
Status and Challenges: January 2017
HPC 2016 HIGH PERFORMANCE COMPUTING
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Big Data, Simulations and HPC Convergence
Implementing parts of HPC-ABDS in a multi-disciplinary collaboration
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Distinguishing Parallel and Distributed Computing Performance
HPCCloud 3.0: Big Data on Clouds and HPC
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
Theme 4: High-performance computing for Precision Health Initiative
Structure of Applications and Infrastructure in Convergence of High Performance Computing and Big Data OSTRAVA, CZECH REPUBLIC, September 7 - 9, 2016 Geoffrey.
Big Data and Simulations: HPC and Clouds
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
IEEE BigData 2016 December 5-8, Washington D.C.
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Big Data and Simulations: HPC and Clouds
Data Science Curriculum March
HPC-enhanced IoT and Data-based Grid
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Distinguishing Parallel and Distributed Computing Performance
Research in Digital Science Center
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Distinguishing Parallel and Distributed Computing Performance
HPC Cloud and Big Data Testbed
10th IEEE/ACM International Conference on Utility and Cloud Computing
Discussion: Cloud Computing for an AI First Future
Digital Science Center III
Indiana University, Bloomington
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Using HPC-ABDS for Streaming Data
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
I590 Data Science Curriculum August
Presentation transcript:

Big Data, Simulations and HPC Convergence BDEC: Big Data and Extreme-scale Computing June 15-17 2016 Frankfurt http://www.exascale.org/bdec/meeting/frankfurt Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake, Supun Kamburugamuve June 16, 2016 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington

Components in Big Data HPC Convergence Applications, Benchmarks and Libraries 51 NIST Big Data Use Cases, 7 Computational Giants of the NRC Massive Data Analysis, 13 Berkeley dwarfs, 7 NAS parallel benchmarks Unified discussion by separately discussing data & model for each application; 64 facets– Convergence Diamonds -- characterize applications Pleasingly parallel or Streaming used for data & model; O(N2) Algorithm relevant to model for big data or big simulation “Lustre v. HDFS” just describes data “Volume” large or small separately for data and model Characterization identifies hardware and software features for each application across big data, simulation; “complete” set of benchmarks (NIST) Software Architecture and its implementation HPC-ABDS: Cloud-HPC interoperable software: performance of HPC (High Performance Computing) and the rich functionality of the Apache Big Data Stack. Added HPC to Hadoop, Storm, Heron, Spark; will add to Beam and Flink Work in Apache model contributing code Run same HPC-ABDS across all platforms but “data management” nodes have different balance in I/O, Network and Compute from “model” nodes Optimize to data and model functions as specified by convergence diamonds Do not optimize for simulation and big data

64 Features in 4 views for Unified Classification of Big Data and Simulation Applications Simulations Analytics (Model for Data) Both (All Model for simulations & Data Analytics) (Nearly all combination of Data+Model) (Not surprising! Nearly all Data) (The details : Mix of Data and Model)

HPC-ABDS

HPC-ABDS Activities of NSF14-43054 Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) Level 16: Applications: Datamining for molecular dynamics, Image processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining Level 16: Algorithms: Generic and application specific; SPIDAL Library Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop, Spark, Flink. Improve Inter- and Intra-node performance; science data structures Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark, Flink, Giraph) using HPC runtime technologies, Harp Level 11: Data management: Hbase and MongoDB integrated via use of Beam and other Apache tools; enhance Hbase Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark, Hadoop; integrate Storm and Heron with Slurm Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability

Convergence Language: Recreating Java Grande 128 24 core Haswell nodes on SPIDAL Data Analytics Best Java factor of 10 faster than “out of the box”; comparable to C++ Best Threads intra node; MPI inter node Best MPI; inter and intra node MPI; inter/intra node; Java not optimized Speedup compared to 1 process per node on 48 nodes