Big Data Analytics and HPC Platforms

Slides:

Advertisements

Similar presentations

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.

Advertisements

Spark: Cluster Computing with Working Sets

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Practical Machine Learning Pipelines with MLlib Joseph K. Bradley March 18, 2015 Spark Summit East 2015.

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Machine Learning as a Service

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

Multimedia Analytics Jianping Fan Department of Computer Science University of North Carolina at Charlotte.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Petr Škoda, Jakub Koza Astronomical Institute Academy of Sciences

Geoffrey Fox Panel Talk: February

Image taken from: slideshare

- Inter-departmental Lab

Presented by: Omar Alqahtani Fall 2016

Organizations Are Embracing New Opportunities

Big Data is a Big Deal!.

PROTECT | OPTIMIZE | TRANSFORM

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Department of Intelligent Systems Engineering

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

ANOMALY DETECTION FRAMEWORK FOR BIG DATA

Big Data A Quick Review on Analytical Tools

Antoine Guitton, Geophysics Department, CSM

Working With Azure Batch AI

Status and Challenges: January 2017

Pathology Spatial Analysis February 2017

Chilimbi, et al. (2014) Microsoft Research

Spark Presentation.

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Data Platform and Analytics Foundational Training

Hadoop Clusters Tess Fulkerson.

University of Technology

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Interactive Website (

Sas is open (for business)

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Introduction to Spark.

Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo

High Performance Big Data Computing in the Digital Science Center

Tutorial Overview February 2017

CMPT 733, SPRING 2016 Jiannan Wang

CS110: Discussion about Spark

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Introduction to Apache

Overview of big data tools

Spark and Scala.

TIM TAYLOR AND JOSH NEEDHAM

Department of Intelligent Systems Engineering

$1M a year for 5 years; 7 institutions Active:

Apache Hadoop and Spark

CMPT 733, SPRING 2017 Jiannan Wang

Fast, Interactive, Language-Integrated Cluster Computing

Big-Data Analytics with Azure HDInsight

CS 239 – Big Data Systems Fall 2018

Big Data, Simulations and HPC Convergence

What's New in eCognition 9

Lecture 29: Distributed Systems

Introduction to Azure Data Lake

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

Presentation transcript:

Big Data Analytics and HPC Platforms Lei Huang Assistant Professor Computer Science Department Prairie View A&M University lhuang@pvamu.edu NPC 2016 Xi’an, China Oct 29, 2016 Sponsored by NSF

The Cloud Computing Research Lab Goal: build a scalable cloud platform for data computing and analytics Built on top of Apache Hadoop and Spark Big data storage, computing, analytics and visualization Scalable performance High-level parallel programming languages to facilitate R&D User-friendly interface Target to the geophysics and image processing domains

Domain-specific Data analytics platform based on Spark and Hadoop High level languages: Scala, Java and Python Keep data in distributed memory, and move computation to data A common data representation (RDD) for parallel transformations An open and integrated platform with a variety of tools/packages Balance of performance and productivity

The needs of converging HPC with big data Problems of Spark RDD only allows one-time write Inefficient memory usage GC takes time Overheads of JVM Python performance is even worse

What Big data needs from HPC Performance, performance, performance Data Distributions Communication optimizations Memory and cache optimizations for data locality Dynamic scheduling Compiler and runtime optimizations

The Big Data Challenge In exploration of Geophysics Dataset sizes for a seismic survey 50 km square: 2D ~1975 50 lines, spaced 1 km apart, each line 50 km w/ 50 m bins 20-fold stack, 6 sec @ 4 msec sampling 1 M traces, 6 GB 3D ~2000 25 x 25 m bins, 2000 x 2000 lines, 100-fold stack 400 M traces, 1.2 TB Advanced 3D ~2015 Double bin resolution, double fold, 8 azimuths 8 sec recording for longer offsets 25,600 M traces, >100 TB Time-Lapse (4D) exploration through production 1 TB /km2 – Eric Green, BP (Mar 2015) 10,000 M traces (multiple passes), 2,500 TB

Volumetric Data Analytics Platform Software Stack Data Analytics Applications Visualization Workflow Data Server Volumetric Data Analytics SDK OpenCV/Breeze VolumeRDD MLlib DL4J / Caffe Spark Streaming Spark Batch Spark Interactive Hadoop HDFS YARN Mesos Cassandra

Volumetric Data Analytics SDK loadFromFile (HDFS) aggregate overlap transpose repartition sample trace line applyMap (usrFunc) save (HDFS) get

Parallel Processing Templates

Template Code Generator HDFS CG User’s Parameter Template Codes Spark Application Spark Jobserver CG Kernel Codes Database

Seismic Data Analytics Process 1. Training: Label data Feature Vectors Seismic volume Seismic Attributes Evaluation Metrics Machine Learning Algorithms 2. Prediction: Feature Vectors Seismic volume Model Predicted Results Seismic Attributes Seismic Data Analytics SDK Spark Distributed Processing Engine

Machine Learning COMBINES Multiple Attribute Volumes Each seismic dataset yields a suite of attributes that are possible faulting indicators Thinned fault likelihood* from amplitude envelope Dip-direction curvature from amplitude envelope * Many thanks to the “fault likelihood” attribute computed by Dave Hale’s IPF package from Colorado School of Mines.

Geological Faults Classification Model Precision Recall F Accuracy SVM 33.37 82.76 47.57 95.56 CNN 66.93 88.51 76.22 97.48

Feature Extraction Performance for 100GB data Python: 2 days Spark: out of memory C++: 3 hours C++, OpenMP, NFS: 45 minutes Spark, C++, OpenMP, NFS: 23 minutes Spark, C++, OpenMP, local disks: 1.0 minute

Visualization for BIG dATA Data Distributions for both computation and visualization Built on Open Inventor

New Project NSF EAGER: A Data Flow Approach to Meet the Challenges of Big Data Analytics Evaluate machine learning algorithms with Fresh Breeze Potential direction to bring HPC to Big Data

acknowledgement National Science Foundation