Big Data Analytics and HPC Platforms

Slides:



Advertisements
Similar presentations
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Advertisements

Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Machine Learning as a Service
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Multimedia Analytics Jianping Fan Department of Computer Science University of North Carolina at Charlotte.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences
Geoffrey Fox Panel Talk: February
Image taken from: slideshare
- Inter-departmental Lab
Presented by: Omar Alqahtani Fall 2016
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Department of Intelligent Systems Engineering
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Spark Streaming for Real Time data analysis
Introduction to Distributed Platforms
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Big Data A Quick Review on Analytical Tools
Antoine Guitton, Geophysics Department, CSM
Working With Azure Batch AI
Status and Challenges: January 2017
Pathology Spatial Analysis February 2017
Chilimbi, et al. (2014) Microsoft Research
Spark Presentation.
NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.
Data Platform and Analytics Foundational Training
Hadoop Clusters Tess Fulkerson.
University of Technology
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Interactive Website (
Sas is open (for business)
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Introduction to Spark.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
High Performance Big Data Computing in the Digital Science Center
Tutorial Overview February 2017
CMPT 733, SPRING 2016 Jiannan Wang
CS110: Discussion about Spark
Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data
Introduction to Apache
Overview of big data tools
Spark and Scala.
TIM TAYLOR AND JOSH NEEDHAM
Department of Intelligent Systems Engineering
$1M a year for 5 years; 7 institutions Active:
Apache Hadoop and Spark
CMPT 733, SPRING 2017 Jiannan Wang
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
CS 239 – Big Data Systems Fall 2018
Big Data, Simulations and HPC Convergence
What's New in eCognition 9
Lecture 29: Distributed Systems
Introduction to Azure Data Lake
Convergence of Big Data and Extreme Computing
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Big Data Analytics and HPC Platforms Lei Huang Assistant Professor Computer Science Department Prairie View A&M University lhuang@pvamu.edu NPC 2016 Xi’an, China Oct 29, 2016 Sponsored by NSF

The Cloud Computing Research Lab Goal: build a scalable cloud platform for data computing and analytics Built on top of Apache Hadoop and Spark Big data storage, computing, analytics and visualization Scalable performance High-level parallel programming languages to facilitate R&D User-friendly interface Target to the geophysics and image processing domains

Domain-specific Data analytics platform based on Spark and Hadoop High level languages: Scala, Java and Python Keep data in distributed memory, and move computation to data A common data representation (RDD) for parallel transformations An open and integrated platform with a variety of tools/packages Balance of performance and productivity

The needs of converging HPC with big data Problems of Spark RDD only allows one-time write Inefficient memory usage GC takes time Overheads of JVM Python performance is even worse

What Big data needs from HPC Performance, performance, performance Data Distributions Communication optimizations Memory and cache optimizations for data locality Dynamic scheduling Compiler and runtime optimizations

The Big Data Challenge In exploration of Geophysics Dataset sizes for a seismic survey 50 km square: 2D ~1975 50 lines, spaced 1 km apart, each line 50 km w/ 50 m bins 20-fold stack, 6 sec @ 4 msec sampling 1 M traces, 6 GB 3D ~2000 25 x 25 m bins, 2000 x 2000 lines, 100-fold stack 400 M traces, 1.2 TB Advanced 3D ~2015 Double bin resolution, double fold, 8 azimuths 8 sec recording for longer offsets 25,600 M traces, >100 TB Time-Lapse (4D) exploration through production 1 TB /km2 – Eric Green, BP (Mar 2015) 10,000 M traces (multiple passes), 2,500 TB

Volumetric Data Analytics Platform Software Stack Data Analytics Applications Visualization Workflow Data Server Volumetric Data Analytics SDK OpenCV/Breeze VolumeRDD MLlib DL4J / Caffe Spark Streaming Spark Batch Spark Interactive Hadoop HDFS YARN Mesos Cassandra

Volumetric Data Analytics SDK loadFromFile (HDFS) aggregate overlap transpose repartition sample trace line applyMap (usrFunc) save (HDFS) get

Parallel Processing Templates

Template Code Generator HDFS CG User’s Parameter Template Codes Spark Application Spark Jobserver CG Kernel Codes Database

Seismic Data Analytics Process 1. Training: Label data Feature Vectors Seismic volume Seismic Attributes Evaluation Metrics Machine Learning Algorithms 2. Prediction: Feature Vectors Seismic volume Model Predicted Results Seismic Attributes Seismic Data Analytics SDK Spark Distributed Processing Engine

Machine Learning COMBINES Multiple Attribute Volumes Each seismic dataset yields a suite of attributes that are possible faulting indicators Thinned fault likelihood* from amplitude envelope Dip-direction curvature from amplitude envelope * Many thanks to the “fault likelihood” attribute computed by Dave Hale’s IPF package from Colorado School of Mines.

Geological Faults Classification Model Precision Recall F Accuracy SVM 33.37 82.76 47.57 95.56 CNN 66.93 88.51 76.22 97.48

Feature Extraction Performance for 100GB data Python: 2 days Spark: out of memory C++: 3 hours C++, OpenMP, NFS: 45 minutes Spark, C++, OpenMP, NFS: 23 minutes Spark, C++, OpenMP, local disks: 1.0 minute

Visualization for BIG dATA Data Distributions for both computation and visualization Built on Open Inventor

New Project NSF EAGER: A Data Flow Approach to Meet the Challenges of Big Data Analytics Evaluate machine learning algorithms with Fresh Breeze Potential direction to bring HPC to Big Data

acknowledgement National Science Foundation