Big Data Analytics and HPC Platforms Lei Huang Assistant Professor Computer Science Department Prairie View A&M University lhuang@pvamu.edu NPC 2016 Xi’an, China Oct 29, 2016 Sponsored by NSF
The Cloud Computing Research Lab Goal: build a scalable cloud platform for data computing and analytics Built on top of Apache Hadoop and Spark Big data storage, computing, analytics and visualization Scalable performance High-level parallel programming languages to facilitate R&D User-friendly interface Target to the geophysics and image processing domains
Domain-specific Data analytics platform based on Spark and Hadoop High level languages: Scala, Java and Python Keep data in distributed memory, and move computation to data A common data representation (RDD) for parallel transformations An open and integrated platform with a variety of tools/packages Balance of performance and productivity
The needs of converging HPC with big data Problems of Spark RDD only allows one-time write Inefficient memory usage GC takes time Overheads of JVM Python performance is even worse
What Big data needs from HPC Performance, performance, performance Data Distributions Communication optimizations Memory and cache optimizations for data locality Dynamic scheduling Compiler and runtime optimizations
The Big Data Challenge In exploration of Geophysics Dataset sizes for a seismic survey 50 km square: 2D ~1975 50 lines, spaced 1 km apart, each line 50 km w/ 50 m bins 20-fold stack, 6 sec @ 4 msec sampling 1 M traces, 6 GB 3D ~2000 25 x 25 m bins, 2000 x 2000 lines, 100-fold stack 400 M traces, 1.2 TB Advanced 3D ~2015 Double bin resolution, double fold, 8 azimuths 8 sec recording for longer offsets 25,600 M traces, >100 TB Time-Lapse (4D) exploration through production 1 TB /km2 – Eric Green, BP (Mar 2015) 10,000 M traces (multiple passes), 2,500 TB
Volumetric Data Analytics Platform Software Stack Data Analytics Applications Visualization Workflow Data Server Volumetric Data Analytics SDK OpenCV/Breeze VolumeRDD MLlib DL4J / Caffe Spark Streaming Spark Batch Spark Interactive Hadoop HDFS YARN Mesos Cassandra
Volumetric Data Analytics SDK loadFromFile (HDFS) aggregate overlap transpose repartition sample trace line applyMap (usrFunc) save (HDFS) get
Parallel Processing Templates
Template Code Generator HDFS CG User’s Parameter Template Codes Spark Application Spark Jobserver CG Kernel Codes Database
Seismic Data Analytics Process 1. Training: Label data Feature Vectors Seismic volume Seismic Attributes Evaluation Metrics Machine Learning Algorithms 2. Prediction: Feature Vectors Seismic volume Model Predicted Results Seismic Attributes Seismic Data Analytics SDK Spark Distributed Processing Engine
Machine Learning COMBINES Multiple Attribute Volumes Each seismic dataset yields a suite of attributes that are possible faulting indicators Thinned fault likelihood* from amplitude envelope Dip-direction curvature from amplitude envelope * Many thanks to the “fault likelihood” attribute computed by Dave Hale’s IPF package from Colorado School of Mines.
Geological Faults Classification Model Precision Recall F Accuracy SVM 33.37 82.76 47.57 95.56 CNN 66.93 88.51 76.22 97.48
Feature Extraction Performance for 100GB data Python: 2 days Spark: out of memory C++: 3 hours C++, OpenMP, NFS: 45 minutes Spark, C++, OpenMP, NFS: 23 minutes Spark, C++, OpenMP, local disks: 1.0 minute
Visualization for BIG dATA Data Distributions for both computation and visualization Built on Open Inventor
New Project NSF EAGER: A Data Flow Approach to Meet the Challenges of Big Data Analytics Evaluate machine learning algorithms with Fresh Breeze Potential direction to bring HPC to Big Data
acknowledgement National Science Foundation