Convergence of Big Data and Extreme Computing

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
Geoffrey Fox Panel Talk: February
Big Data Analytics and HPC Platforms
Panel: Beyond Exascale Computing
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
Department of Intelligent Systems Engineering
Introduction to Distributed Platforms
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman
Status and Challenges: January 2017
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Big Data, Simulations and HPC Convergence
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Department of Intelligent Systems Engineering
Distinguishing Parallel and Distributed Computing Performance
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
AI-Driven Science and Engineering with the Global AI and Modeling Supercomputer GAIMSC Workshop on Clusters, Clouds, and Data for Scientific Computing.
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
Hilton Hotel Honolulu Tapa Ballroom 2 June 26, 2017 Geoffrey Fox
A Tale of Two Convergences: Applications and Computing Platforms
Martin Swany Gregor von Laszewski Thomas Sterling Clint Whaley
Distinguishing Parallel and Distributed Computing Performance
Research in Digital Science Center
CS110: Discussion about Spark
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Distinguishing Parallel and Distributed Computing Performance
Clouds from FutureGrid’s Perspective
HPC Cloud and Big Data Testbed
Discussion: Cloud Computing for an AI First Future
Digital Science Center III
Indiana University, Bloomington
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
Digital Science Center
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Big Data, Simulations and HPC Convergence
Motivation Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication.
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Research in Digital Science Center
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
I590 Data Science Curriculum August
Presentation transcript:

Convergence of Big Data and Extreme Computing BDEC Birds of a Feather Geoffrey Fox November 16, 2016 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington NSF Funded through NSF14-43054 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

Using “Apache” (Commercial Big Data) Data Systems for Science Pro: Use rich functionality and usability of ABDS (Apache Big Data Stack) Pro: Sustainability model of community open source Con (Pro for many commercial users): Optimized for fault-tolerance and usability and not performance Feature: Naturally run on clouds and not HPC platforms Feature: Cloud is logically centralized, physically distributed Question: how do science data analysis requirements differ from those commercially e.g. recommender systems heavily used commercially Approach: HPC-ABDS using HPC runtime and tools to enhance commercial data systems (ABDS on top of HPC) 11/7/2019

Harp (Hadoop Plugin) brings HPC to ABDS Judy Qiu: Iterative HPC communication; scientific data abstractions Careful support of distributed data AND distributed model Avoids parameter server approach but distributes model over worker nodes and supports collective communication to bring global model to each node Have also added HPC to Apache Storm and Heron; working on adding Parallel Computing Runtime to Distributed computing model built into Apache Spark, Flink, Beam Shuffle M Collective Communication R MapCollective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications 11/7/2019

HPC Runtime versus ABDS distributed Computing Model on Data Analytics Hadoop writes to disk and is slowest; Spark and Flink spawn many processes and do not support allreduce directly; MPI does in-place combined reduce/broadcast 11/7/2019

Some Observations Need an HPC project in Apache Foundation Need to distinguish data management from data analytics Management and Search I/O intensive and suitable for classic clouds Science data has fewer users than commercial Analytics has many features in common with large scale simulations Data analytics often SPMD, BSP and benefits from high performance networking and communication libraries. Decompose Model (as in simulation) and Data (bit different and confusing across nodes of cluster Big Data Ogres classify applications with 64 features derived from NIST collection of use cases Overall application structure e.g. pleasingly parallel Data Features e.g. from IoT, stored in HDFS …. Processing Features e.g. uses neural nets or conjugate gradient Execution Structure e.g. data or model volume 11/7/2019

Summary and Conclusions This talk covers http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/whitepapers/bdec2016pathways-16Nov16-b.pdf 3.4.4 Alternative 4: Logically Centralized Data (in the Cloud) 3.4.5 Research Computing Moves to Big Data Stack 5.1 Taxonomy of Application/workflow patterns and templates Questions to answer: Use of both HPC and high-end data analytics hardware platforms: distinguish management and analytics; need work on storage and I/O model but classic HPC good for data analytics but the many pleasingly parallel analytics can use clouds/HTC etc. software: Use HPC-ABDS for data and simulations and algorithms: need to build high performance libraries for streaming and batch use so that scientists can move seamlessly between both simulation and data analysis. 11/7/2019