Big Data Analytics and HPC Platforms

Big Data Analytics and HPC Platforms
Lei Huang Assistant Professor Computer Science Department Prairie View A&M University NPC 2016 Xi’an, China Oct 29, 2016 Sponsored by NSF

The Cloud Computing Research Lab
Goal: build a scalable cloud platform for data computing and analytics Built on top of Apache Hadoop and Spark Big data storage, computing, analytics and visualization Scalable performance High-level parallel programming languages to facilitate R&D User-friendly interface Target to the geophysics and image processing domains

Domain-specific Data analytics platform based on Spark and Hadoop
High level languages: Scala, Java and Python Keep data in distributed memory, and move computation to data A common data representation (RDD) for parallel transformations An open and integrated platform with a variety of tools/packages Balance of performance and productivity

The needs of converging HPC with big data
Problems of Spark RDD only allows one-time write Inefficient memory usage GC takes time Overheads of JVM Python performance is even worse

What Big data needs from HPC
Performance, performance, performance Data Distributions Communication optimizations Memory and cache optimizations for data locality Dynamic scheduling Compiler and runtime optimizations

The Big Data Challenge In exploration of Geophysics
Dataset sizes for a seismic survey 50 km square: 2D ~1975 50 lines, spaced 1 km apart, each line 50 km w/ 50 m bins 20-fold stack, 6 4 msec sampling 1 M traces, 6 GB 3D ~2000 25 x 25 m bins, 2000 x 2000 lines, 100-fold stack 400 M traces, 1.2 TB Advanced 3D ~2015 Double bin resolution, double fold, 8 azimuths 8 sec recording for longer offsets 25,600 M traces, >100 TB Time-Lapse (4D) exploration through production 1 TB /km2 – Eric Green, BP (Mar 2015) 10,000 M traces (multiple passes), 2,500 TB

Volumetric Data Analytics Platform Software Stack
Data Analytics Applications Visualization Workflow Data Server Volumetric Data Analytics SDK OpenCV/Breeze VolumeRDD MLlib DL4J / Caffe Spark Streaming Spark Batch Spark Interactive Hadoop HDFS YARN Mesos Cassandra

Volumetric Data Analytics SDK
loadFromFile (HDFS) aggregate overlap transpose repartition sample trace line applyMap (usrFunc) save (HDFS) get

Parallel Processing Templates

Template Code Generator
HDFS CG User’s Parameter Template Codes Spark Application Spark Jobserver CG Kernel Codes Database

Seismic Data Analytics Process
1. Training: Label data Feature Vectors Seismic volume Seismic Attributes Evaluation Metrics Machine Learning Algorithms 2. Prediction: Feature Vectors Seismic volume Model Predicted Results Seismic Attributes Seismic Data Analytics SDK Spark Distributed Processing Engine

Machine Learning COMBINES Multiple Attribute Volumes
Each seismic dataset yields a suite of attributes that are possible faulting indicators Thinned fault likelihood* from amplitude envelope Dip-direction curvature from amplitude envelope * Many thanks to the “fault likelihood” attribute computed by Dave Hale’s IPF package from Colorado School of Mines.

Geological Faults Classification
Model Precision Recall F Accuracy SVM 33.37 82.76 47.57 95.56 CNN 66.93 88.51 76.22 97.48

Feature Extraction Performance for 100GB data Python: 2 days
Spark: out of memory C++: 3 hours C++, OpenMP, NFS: 45 minutes Spark, C++, OpenMP, NFS: 23 minutes Spark, C++, OpenMP, local disks: 1.0 minute

Visualization for BIG dATA
Data Distributions for both computation and visualization Built on Open Inventor

New Project NSF EAGER: A Data Flow Approach to Meet the Challenges of Big Data Analytics Evaluate machine learning algorithms with Fresh Breeze Potential direction to bring HPC to Big Data

acknowledgement National Science Foundation

Big Data Analytics and HPC Platforms

Similar presentations

Presentation on theme: "Big Data Analytics and HPC Platforms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Analytics and HPC Platforms

Similar presentations

Presentation on theme: "Big Data Analytics and HPC Platforms"— Presentation transcript:

Similar presentations

About project

Feedback