Convergence of Big Data and Extreme Computing

Slides:

Advertisements

Similar presentations

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Advertisements

Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.

Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science

HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox

Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox

Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.

Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox

SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.

Geoffrey Fox Panel Talk: February

Big Data Analytics and HPC Platforms

Panel: Beyond Exascale Computing

Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.

Department of Intelligent Systems Engineering

Introduction to Distributed Platforms

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman

Status and Challenges: January 2017

HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS

Big Data, Simulations and HPC Convergence

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Department of Intelligent Systems Engineering

Distinguishing Parallel and Distributed Computing Performance

Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.

Some Remarks for Cloud Forward Internet2 Workshop

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Department of Intelligent Systems Engineering

I590 Data Science Curriculum August

Applications SPIDAL MIDAS ABDS

High Performance Big Data Computing in the Digital Science Center

Data Science Curriculum March

AI-Driven Science and Engineering with the Global AI and Modeling Supercomputer GAIMSC Workshop on Clusters, Clouds, and Data for Scientific Computing.

Tutorial Overview February 2017

Department of Intelligent Systems Engineering

AI First High Performance Big Data Computing for Industry 4.0

Data Science for Life Sciences Research & the Public Good

Hilton Hotel Honolulu Tapa Ballroom 2 June 26, 2017 Geoffrey Fox

A Tale of Two Convergences: Applications and Computing Platforms

Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley

Distinguishing Parallel and Distributed Computing Performance

Research in Digital Science Center

CS110: Discussion about Spark

Scalable Parallel Interoperable Data Analytics Library

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Distinguishing Parallel and Distributed Computing Performance

Clouds from FutureGrid’s Perspective

HPC Cloud and Big Data Testbed

Discussion: Cloud Computing for an AI First Future

Digital Science Center III

Indiana University, Bloomington

Twister2: Design of a Big Data Toolkit

Department of Intelligent Systems Engineering

Digital Science Center

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.

$1M a year for 5 years; 7 institutions Active:

PHI Research in Digital Science Center

Panel on Research Challenges in Big Data

Cloud versus Cloud: How Will Cloud Computing Shape Our World?

Big Data, Simulations and HPC Convergence

Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.

Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.

Research in Digital Science Center

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

I590 Data Science Curriculum August

Presentation transcript:

Convergence of Big Data and Extreme Computing BDEC Birds of a Feather Geoffrey Fox November 16, 2016 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington NSF Funded through NSF14-43054 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Using “Apache” (Commercial Big Data) Data Systems for Science Pro: Use rich functionality and usability of ABDS (Apache Big Data Stack) Pro: Sustainability model of community open source Con (Pro for many commercial users): Optimized for fault-tolerance and usability and not performance Feature: Naturally run on clouds and not HPC platforms Feature: Cloud is logically centralized, physically distributed Question: how do science data analysis requirements differ from those commercially e.g. recommender systems heavily used commercially Approach: HPC-ABDS using HPC runtime and tools to enhance commercial data systems (ABDS on top of HPC) 11/7/2019

Harp (Hadoop Plugin) brings HPC to ABDS Judy Qiu: Iterative HPC communication; scientific data abstractions Careful support of distributed data AND distributed model Avoids parameter server approach but distributes model over worker nodes and supports collective communication to bring global model to each node Have also added HPC to Apache Storm and Heron; working on adding Parallel Computing Runtime to Distributed computing model built into Apache Spark, Flink, Beam Shuffle M Collective Communication R MapCollective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications 11/7/2019

HPC Runtime versus ABDS distributed Computing Model on Data Analytics Hadoop writes to disk and is slowest; Spark and Flink spawn many processes and do not support allreduce directly; MPI does in-place combined reduce/broadcast 11/7/2019

Some Observations Need an HPC project in Apache Foundation Need to distinguish data management from data analytics Management and Search I/O intensive and suitable for classic clouds Science data has fewer users than commercial Analytics has many features in common with large scale simulations Data analytics often SPMD, BSP and benefits from high performance networking and communication libraries. Decompose Model (as in simulation) and Data (bit different and confusing across nodes of cluster Big Data Ogres classify applications with 64 features derived from NIST collection of use cases Overall application structure e.g. pleasingly parallel Data Features e.g. from IoT, stored in HDFS …. Processing Features e.g. uses neural nets or conjugate gradient Execution Structure e.g. data or model volume 11/7/2019

Summary and Conclusions This talk covers http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/whitepapers/bdec2016pathways-16Nov16-b.pdf 3.4.4 Alternative 4: Logically Centralized Data (in the Cloud) 3.4.5 Research Computing Moves to Big Data Stack 5.1 Taxonomy of Application/workflow patterns and templates Questions to answer: Use of both HPC and high-end data analytics hardware platforms: distinguish management and analytics; need work on storage and I/O model but classic HPC good for data analytics but the many pleasingly parallel analytics can use clouds/HTC etc. software: Use HPC-ABDS for data and simulations and algorithms: need to build high performance libraries for streaming and batch use so that scientists can move seamlessly between both simulation and data analysis. 11/7/2019