Theme 4: High-performance computing for Precision Health Initiative

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Hadoop Ecosystem Overview
Big Data Open Source Software and Projects Unit 0 Part B: Class Introduction Data Science Curriculum March Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Big Data and Clouds: Challenges and Opportunities NIST January Geoffrey Fox
BIG DATA APPLICATIONS & ANALYTICS LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS 1/26/2015 Cloud Computing Software 1 Geoffrey Fox January BigDat.
Data Science at Digital Science October Geoffrey Fox Judy Qiu
FutureGrid Connection to Comet Testbed and On Ramp as a Service Geoffrey Fox Indiana University Infra structure.
Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)
Recipes for Success with Big Data using FutureGrid Cloudmesh SDSC Exhibit Booth New Orleans Convention Center November Geoffrey Fox, Gregor von.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
Indiana University Faculty Geoffrey Fox, David Crandall, Judy Qiu, Gregor von Laszewski Data Science at Digital Science Center.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Geoffrey Fox Panel Talk: February
Panel: Beyond Exascale Computing
Digital Science Center
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes for an HPC Enhanced Cloud and Fog Spanning IoT Big Data and Big Simulations.
Digital Science Center II
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Department of Intelligent Systems Engineering
Next Generation IoT and Data-based Grid
Structure of Problems and its Relation to Software and Hardware
Status and Challenges: January 2017
HPC 2016 HIGH PERFORMANCE COMPUTING
Characteristics of Future Big Data Platforms
HPC Cloud Convergence February 2017 Software: MIDAS HPC-ABDS
Big Data, Simulations and HPC Convergence
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Hadoop Clusters Tess Fulkerson.
Department of Intelligent Systems Engineering
Research in Digital Science Center
HPCCloud 3.0: Big Data on Clouds and HPC
Big Data Processing Issues taking care of Application Requirements, Hardware, HPC, Grid (distributed), Edge and Cloud Computing Geoffrey Fox, November.
Some Remarks for Cloud Forward Internet2 Workshop
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Department of Intelligent Systems Engineering
Digital Science Center I
I590 Data Science Curriculum August
Applications SPIDAL MIDAS ABDS
High Performance Big Data Computing in the Digital Science Center
Research in Intelligent Systems Engineering
Data Science Curriculum March
Tutorial Overview February 2017
Department of Intelligent Systems Engineering
AI First High Performance Big Data Computing for Industry 4.0
Data Science for Life Sciences Research & the Public Good
A Tale of Two Convergences: Applications and Computing Platforms
Research in Digital Science Center
Scalable Parallel Interoperable Data Analytics Library
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Introduction to Apache
Clouds from FutureGrid’s Perspective
HPC Cloud and Big Data Testbed
Overview of big data tools
Twister2: Design of a Big Data Toolkit
Department of Intelligent Systems Engineering
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
$1M a year for 5 years; 7 institutions Active:
PHI Research in Digital Science Center
Panel on Research Challenges in Big Data
Big-Data Analytics with Azure HDInsight
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Big Data, Simulations and HPC Convergence
Research in Digital Science Center
Research in Digital Science Center
Geoffrey Fox High-Performance Big Data Computing: International, National, and Local initiatives COLLABORATORS China and IU: Fudan University, SICE, OVPR.
Research in Digital Science Center
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Theme 4: High-performance computing for Precision Health Initiative PHI Internal Presentation Indianapolis June 15, 2017 Geoffrey Fox gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Digital Science Center Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington

Intelligence Systems Engineering 13 faculty expecting to add 5 this year High Performance and Computer Engineering Bioengineering Computational approaches: cancer and virtual tissues (Glazier, Macklin) Maria Bondesson; toxicity experiments on zebrafish Feng Guo: 2016- Postdoc Stanford; 2015 Ph.D. in Engineering Science and Mechanics PSU; B.S. in Physics, Wuhan 2007.Acoustic tweezers. biomedical devices, and instruments for a wide variety of applications ranging from Single Cell Analysis and Prenatal Diagnostics to Virology and Neuroscience Gary Lewis; PhD University of Illinois Chicago, Bioengineering (Currently faculty at UNC); wearable data analysis with Kinsey Institute Other areas: Robotics/Internet of Things, Nanoengineering, Intelligent Systems, Neuroengineering, Environmental Engineering 9/22/2018

Digital Science Center Faculty (Fox, Crandall, Qiu) Postdocs/Research staff, students (Ph.D. and Undergraduate), Systems, Software Engineering and Business staff Run computer infrastructure for Cloud and HPC research 64 node system Tango with high performance disks (SSD, NVRam = 5x SSD and 25xHDD) and Intel KNL (Knights Landing) manycore (68-72) chips. Omnipath interconnect 128 node system Juliet with two 12-18 core Haswell chips, SSD and conventional HDD disks. Infiniband Interconnect 16 GPU, 4 Haswell node deep learning system Romeo All can run HDFS and store data on nodes 200 older nodes Big Data Engineering (Data Science) research with applied collaborators System architecture, performance Teach basic and advanced Cloud Computing and bigdata courses Work with NIST on Big Data Standards and non-proprietary Frameworks 9/22/2018

PHI Relevant Activities Big Data Application analysis developed with NIST 52 use cases (including 1 from Regenstrief) New survey starting to gather more data Patterns of execution and classification features IoTCloud: technology to control robots and IoT systems from the cloud Held 2 Streaming data analysis workshops with NSF, DoE, Industry Major NSF SPIDAL scalable data science project including pathology image analysis with SUNY-SB developing HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack Work on Heron, Storm, Hadoop, Spark, Flink, Hbase with communication, API’s and scheduling addressed Many performance analyses for machine learning applications on different platforms Cloudmesh technology supports software automation with same scripts deployable on HPC, OpenStack, Amazon, Azure, Docker etc. 9/22/2018

Components of Big Data Stack Google likes to show a timeline 2002 Google File System GFS ~HDFS 2004 MapReduce Apache Hadoop 2006 Big Table Apache Hbase 2008 Dremel Apache Drill 2009 Pregel Apache Giraph 2010 FlumeJava Apache Crunch 2010 Colossus better GFS 2012 Spanner horizontally scalable NewSQL database ~CockroachDB 2013 F1 horizontally scalable SQL database 2013 MillWheel ~Apache Storm, Twitter Heron (Google not first!) 2015 Cloud Dataflow Apache Beam with Spark or Flink (dataflow) engine Functionalities not identified: Security, Data Transfer, Scheduling, serverless computing

HPC-ABDS 9/22/2018

HPC-ABDS Integrated Software Big Data ABDS HPCCloud 3.0 HPC, Cluster 17. Orchestration Beam, Crunch, Tez, Cloud Dataflow Kepler, Pegasus, Taverna 16. Libraries MLlib/Mahout, TensorFlow, CNTK, R, Python ScaLAPACK, PETSc, Matlab 15A. High Level Programming Pig, Hive, Drill Domain-specific Languages 15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk XSEDE Software Stack Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python Fortran, C/C++, Python 14B. Streaming Storm, Kafka, Kinesis 13,14A. Parallel Runtime Hadoop, MapReduce MPI/OpenMP/OpenCL 2. Coordination Zookeeper 12. Caching Memcached 11. Data Management Hbase, Accumulo, Neo4J, MySQL iRODS 10. Data Transfer Sqoop GridFTP 9. Scheduling Yarn, Mesos Slurm 8. File Systems HDFS, Object Stores Lustre 1, 11A Formats Thrift, Protobuf FITS, HDF 5. IaaS OpenStack , Docker Linux, Bare-metal, SR-IOV Infrastructure CLOUDS Clouds and/or HPC SUPERCOMPUTERS CUDA, Exascale Runtime 5/17/2016

HPCCloud Convergence Architecture Running same HPC-ABDS software across all platforms but data management machine has different balance in I/O, Network and Compute from “model” machine Note data storage approach: HDFS v. Object Store v. Lustre style file systems is still rather unclear The Model behaves similarly whether from Big Data or Big Simulation. Data Management Model for Big Data and Big Simulation HPCCloud Capacity-style Operational Model matches hardware features with application requirements 9/22/2018

Multidimensional Scaling MDS Results with Flink, Spark and MPI MDS execution time on 16 nodes with 20 processes in each node with varying number of points MDS execution time with 32000 points on varying number of nodes. Each node runs 20 parallel tasks MDS performed poorly on Flink due to its lack of support for nested iterations. Need to have data locality (alignment) sensitivity

https://spidal-gw.dsc.soic.indiana.edu/public/resultsets/1678860580 Pathology Image Features (cells) 11 Images 20,000 features per image. 96 properties per feature but projected to 3D for visualization – colored by image Parallel MPI job 9/22/2018

170,000 distinct Fungi sequences. https://spidal-gw.dsc.soic.indiana.edu/public/resultsets/1273112137 170,000 distinct Fungi sequences. 64 clusters Projected to 3D and visualized. Colored by cluster Pipeline of 2 MPI jobs 9/22/2018

2D Vector Clustering with cutoff at 3 σ Orange Star – outside all clusters; yellow circle cluster centers LCMS Mass Spectrometer Peak Clustering with MPI. Charge 2 Sample with 10.9 million points and 420,000 clusters visualized in WebPlotViz. Nature Article "Proteogenomics connects somatic mutations to signalling in breast cancer". https://spidal-gw.dsc.soic.indiana.edu/public/resultsets/1657429765 9/22/2018

Twitter Heron Streaming Software using HPC Hardware Intel Haswell Cluster with 2.4GHz Processors and 56Gbps Infiniband and 1Gbps Ethernet Intel KNL Cluster with 1.4GHz Processors and 100Gbps Omni-Path and 1Gbps Ethernet Large messages Small messages Parallelism of 2 and using 8 Nodes Parallelism of 2 and using 4 Nodes Twitter Heron Streaming Software using HPC Hardware

Knights Landing KNL Data Analytics: Harp, Spark, NOMAD Single Node and Cluster performance: 1.4GHz 68 core nodes Kmeans SGD ALS Strong Scaling Multi Node Parallelism Scaling - Omnipath Interconnect Strong Scaling Single Node Core Parallelism Scaling

HPCCloud Software Defined Systems Significant advantages in specifying job software with scripts such as Chef, Puppet, Ansible – “Software Defined Systems” (SDS) Choose Ansible as Python based Less voluminous than machine images; easier to ensure latest version; easy to recreate image on demand after crashes In work with NIST, we looked at 87 applications from two of our “big data on cloud” classes and from NIST itself (6) The 6 NIST use cases need 27 Ansible roles (distinct software subsystems) and full set of 87 needed 62 separate roles (average 4.75 roles per use case) With NIST Public Big Data group, looking at mapping SDS to system architecture Preparing Ansible specifications of many subsystems and use cases

27 Ansible Roles and Re-use in 6 NIST use cases ID 6 NIST Use Cass Hadoop Mesos Spark Storm Pig Hive Drill HDFS HBase Mysql MongoDB RethinkDB Mahout D3, Tableau nltk MLlib Lucene/Solr OpenCV Python Java maven Ganglia Nagios spark supervisord zookeeper AlchemyAPI R 1 NIST Fingerprint Matching x 2 Human and Face Detection 3 Twitter Analysis 4 Analytics for Healthcare Data/Health Informatics 5 Spatial Big Data/Spatial Statistics/Geographic Information Systems 6 Data Warehousing and Data Mining count

All Presentations Available

Filter Identifying Events Typical Big Data Pattern 2. Perform real time analytics on data source streams and notify users when specified events occur Storm (Heron), Kafka, Hbase, Zookeeper Streaming Data Posted Data Identified Events Filter Identifying Events Repository Specify filter Archive Post Selected Events Fetch streamed Data

6 Forms of MapReduce Describes Architecture of - Problem (Model reflecting data) - Machine - Software Problem Architecture Classifiers such as Pleasingly Parallel, Map-Collective,Map- Streaming, and Graph feature in Processing view