Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data Analytics and HPC Platforms

Similar presentations


Presentation on theme: "Big Data Analytics and HPC Platforms"— Presentation transcript:

1 Big Data Analytics and HPC Platforms
Lei Huang Assistant Professor Computer Science Department Prairie View A&M University NPC 2016 Xi’an, China Oct 29, 2016 Sponsored by NSF

2 The Cloud Computing Research Lab
Goal: build a scalable cloud platform for data computing and analytics Built on top of Apache Hadoop and Spark Big data storage, computing, analytics and visualization Scalable performance High-level parallel programming languages to facilitate R&D User-friendly interface Target to the geophysics and image processing domains

3 Domain-specific Data analytics platform based on Spark and Hadoop
High level languages: Scala, Java and Python Keep data in distributed memory, and move computation to data A common data representation (RDD) for parallel transformations An open and integrated platform with a variety of tools/packages Balance of performance and productivity

4 The needs of converging HPC with big data
Problems of Spark RDD only allows one-time write Inefficient memory usage GC takes time Overheads of JVM Python performance is even worse

5 What Big data needs from HPC
Performance, performance, performance Data Distributions Communication optimizations Memory and cache optimizations for data locality Dynamic scheduling Compiler and runtime optimizations

6 The Big Data Challenge In exploration of Geophysics
Dataset sizes for a seismic survey 50 km square: 2D ~1975 50 lines, spaced 1 km apart, each line 50 km w/ 50 m bins 20-fold stack, 6 4 msec sampling 1 M traces, 6 GB 3D ~2000 25 x 25 m bins, 2000 x 2000 lines, 100-fold stack 400 M traces, 1.2 TB Advanced 3D ~2015 Double bin resolution, double fold, 8 azimuths 8 sec recording for longer offsets 25,600 M traces, >100 TB Time-Lapse (4D) exploration through production 1 TB /km2 – Eric Green, BP (Mar 2015) 10,000 M traces (multiple passes), 2,500 TB

7 Volumetric Data Analytics Platform Software Stack
Data Analytics Applications Visualization Workflow Data Server Volumetric Data Analytics SDK OpenCV/Breeze VolumeRDD MLlib DL4J / Caffe Spark Streaming Spark Batch Spark Interactive Hadoop HDFS YARN Mesos Cassandra

8 Volumetric Data Analytics SDK
loadFromFile (HDFS) aggregate overlap transpose repartition sample trace line applyMap (usrFunc) save (HDFS) get

9 Parallel Processing Templates

10 Template Code Generator
HDFS CG User’s Parameter Template Codes Spark Application Spark Jobserver CG Kernel Codes Database

11 Seismic Data Analytics Process
1. Training: Label data Feature Vectors Seismic volume Seismic Attributes Evaluation Metrics Machine Learning Algorithms 2. Prediction: Feature Vectors Seismic volume Model Predicted Results Seismic Attributes Seismic Data Analytics SDK Spark Distributed Processing Engine

12 Machine Learning COMBINES Multiple Attribute Volumes
Each seismic dataset yields a suite of attributes that are possible faulting indicators Thinned fault likelihood* from amplitude envelope Dip-direction curvature from amplitude envelope * Many thanks to the “fault likelihood” attribute computed by Dave Hale’s IPF package from Colorado School of Mines.

13 Geological Faults Classification
Model Precision Recall F Accuracy SVM 33.37 82.76 47.57 95.56 CNN 66.93 88.51 76.22 97.48

14 Feature Extraction Performance for 100GB data Python: 2 days
Spark: out of memory C++: 3 hours C++, OpenMP, NFS: 45 minutes Spark, C++, OpenMP, NFS: 23 minutes Spark, C++, OpenMP, local disks: 1.0 minute

15 Visualization for BIG dATA
Data Distributions for both computation and visualization Built on Open Inventor

16 New Project NSF EAGER: A Data Flow Approach to Meet the Challenges of Big Data Analytics Evaluate machine learning algorithms with Fresh Breeze Potential direction to bring HPC to Big Data

17 acknowledgement National Science Foundation


Download ppt "Big Data Analytics and HPC Platforms"

Similar presentations


Ads by Google