51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big.

Slides:



Advertisements
Similar presentations
Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:
Advertisements

Large Scale Computing Systems
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Current NIST Definition NIST Big data consists of advanced techniques that harness independent resources for building scalable data systems when the characteristics.
Panel: New Opportunities in High Performance Data Analytics (HPDA) and High Performance Computing (HPC) The 2014 International Conference.
Understanding Big Data Applications and Architectures 1st JTC 1 SGBD Meeting SDSC San Diego March Geoffrey Fox Judy Qiu Shantenu Jha (Rutgers)
What is the "Big Data" version of the Linpack Benchmark? What is “Big Data” version of Berkeley Dwarfs and NAS Parallel Benchmarks? Based on Presentation.
What is the "Big Data" version of the Linpack benchmark? – (We will never get anywhere without one.) Clusters, Clouds, and Data for Scientific Computing.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
NIST Big Data Public Working Group
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Dibbs Research at Digital Science
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Application Provider Visualization Access Analytics Curation Collection.
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions and Architectures S Jha 1, J Qiu 2, A Luckow 1, P Mantha 1, Geoffrey Fox 2 1 Rutgers.
Remarks on Big Data Clustering (and its visualization) Big Data and Extreme-scale Computing (BDEC) Charleston SC May Geoffrey Fox
Big Data Ogres and their Facets Geoffrey Fox, Judy Qiu, Shantenu Jha, Saliya Ekanayake Big Data Ogres are an attempt to characterize applications and algorithms.
Data Science at Digital Science October Geoffrey Fox Judy Qiu
Scientific Computing Environments ( Distributed Computing in an Exascale era) August Geoffrey Fox
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.
K E Y : SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Transformation Provider Visualization Access Analytics Curation Collection.
ISERVOGrid Architecture Working Group Brisbane Australia June Geoffrey Fox Community Grids Lab Indiana University
November Geoffrey Fox Community Grids Lab Indiana University Net-Centric Sensor Grids.
Big Data Open Source Software and Projects ABDS in Summary IV: Level 7 I590 Data Science Curriculum August Geoffrey Fox
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware
Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
K E Y : DATA SW Service Use Big Data Information Flow SW Tools and Algorithms Transfer Hardware (Storage, Networking, etc.) Big Data Framework Scalable.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Big Data Open Source Software and Projects ABDS in Summary II: Layer 5 I590 Data Science Curriculum August Geoffrey Fox
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center 1.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Panel: Beyond Exascale Computing
eScience in the Cloud 2014 Redmond WA April
Returning to Java Grande: High Performance Architecture for Big Data
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Status and Challenges: January 2017
Volume 3, Use Cases and General Requirements Document Scope
Big Data, Simulations and HPC Convergence
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Data Science Curriculum March
Tutorial Overview February 2017
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Overview of big data tools
Department of Intelligent Systems Engineering
Digital Science Center
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Big Data, Simulations and HPC Convergence
Convergence of Big Data and Extreme Computing
I590 Data Science Curriculum August
Presentation transcript:

51 Use Cases and implications for HPC & Apache Big Data Stack Architecture and Ogres International Workshop on Extreme Scale Scientific Computing (Big Data and Extreme Computing (BDEC)) Fukuoka Japan February Geoffrey Fox Judy Qiu Shantenu Jha (Rutgers) School of Informatics and Computing Digital Science Center Indiana University Bloomington

51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V’s, software, hardware (Section 5) Government Operation(4): National Archives and Records Administration, Census Bureau Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense(3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy(1): Smart grid 2 26 Features for each use case

Enhanced Apache Big Data Stack ABDS+ 114 Capabilities Green layers have strong HPC Integration opportunities Functionality of ABDS Performance of HPC

Management Security & Privacy Big Data Application Provider Visualization Access Analytics Curation Collection System Orchestrator DATA SW DATA SW INFORMATION VALUE CHAIN IT VALUE CHAIN Data Consumer Data Provider Horizontally Scalable (VM clusters) Vertically Scalable Horizontally Scalable Vertically Scalable Horizontally Scalable Vertically Scalable Big Data Framework Provider Processing Frameworks (analytic tools, etc.) Platforms (databases, etc.) Infrastructures Physical and Virtual Resources (networking, computing, etc.) DATA SW K E Y : SW Service Use Data Flow Analytics Tools Transfer NIST Big Data Reference Architecture wants to implement selected use cases as patterns/ogres 4 DATA

Mahout and Hadoop MR – Slow due to MapReduce Python slow as Scripting Spark Iterative MapReduce, non optimal communication Harp Hadoop plug in with ~MPI collectives MPI fastest as HPC Increasing Communication Identical Computation

Big Data Ogres and Their Facets from 51 use cases The first Ogre Facet captures different problem “architecture”. Such as (i) Pleasingly Parallel – as in Blast, Protein docking, imagery (ii) Local Machine Learning – ML or filtering pleasingly parallel as in bio-imagery, radar (iii) Global Machine Learning seen in LDA, Clustering etc. with parallel ML over nodes of system (iv) Fusion: Knowledge discovery often involves fusion of multiple methods. (v) Workflow The second Ogre Facet captures source of data (i) SQL, (ii) NOSQL based, (iii) Other Enterprise data systems (10 examples at NIST) (iv)Set of Files (as managed in iRODS), (v) Internet of Things, (vi) Streaming and (vii) HPC simulations. Before data gets to compute system, there is often an initial data gathering phase which is characterized by a block size and timing. Block size varies from month (Remote Sensing, Seismic) to day (genomic) to seconds (Real time control, streaming) There are storage/compute system styles: Dedicated, Permanent, Transient Other characteristics are need for permanent auxiliary/comparison datasets and these could be interdisciplinary implying nontrivial data movement/replication

Detailed Structure of Ogres The third Facet contains Ogres themselves classifying core analytics kernels/mini-applications (i) Recommender Systems (Collaborative Filtering) (ii) SVM and Linear Classifiers (Bayes, Random Forests), (iii) Outlier Detection (iORCA) (iv) Clustering (many methods), (v) PageRank, (vi) LDA (Latent Dirichlet Allocation), (vii) PLSI (Probabilistic Latent Semantic Indexing), (viii) SVD (Singular Value Decomposition), (ix) MDS (Multidimensional Scaling), (x) Graph Algorithms (seen in neural nets, search of RDF Triple stores), (xi) Learning Neural Networks (Deep Learning), (xii) Global Optimization (Variational Bayes); (xiii) Agents, as in epidemiology (swarm approaches) and (xiv) GIS (Geographical Information Systems). These core applications can be classified by features like (a) Flops per byte; (b) Communication Interconnect requirements; (c) Are data points in metric or non-metric spaces (d) Maximum Likelihood, (e)  2 minimizations, and (f) Expectation Maximization (often Steepest descent)

Lessons / Insights Please add to set of 51 use cases Integrate (don’t compete) HPC with “Commodity Big data” (Google to Amazon to Enterprise data Analytics) – i.e. improve Mahout; don’t compete with it – Use Hadoop plug-ins rather than replacing Hadoop – Enhanced Apache Big Data Stack ABDS+ has 114 members – please improve! – 6 zettabytes total data; LHC is ~ zettabytes (100 petabytes) HPC-ABDS+ Integration areas include file systems, cluster resource management, file and object data management, inter process and thread communication, analytics libraries, workflow and monitoring Ogres classify Big Data applications by three facets – each with several exemplars – Guide to breadth and depth of Big Data – Does your architecture/software support all the ogres?