Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Systems Engineering Department, Indiana University

Similar presentations


Presentation on theme: "Intelligent Systems Engineering Department, Indiana University"— Presentation transcript:

1 Intelligent Systems Engineering Department, Indiana University
Scalable High Performance Data Analytics Intel®PCC Showcase: Indiana University May 22, 2017 Judy Qiu Intelligent Systems Engineering Department, Indiana University

2 1. Introduction: HPC-ABDS, Harp (Hadoop plug in), DAAL
Outline Introduction: HPC-ABDS, Harp (Hadoop plug in), DAAL Optimization Methodologies Results (configuration, benchmark) Code Optimization Highlights Conclusions and Future Work

3

4 High Performance – Apache Big Data Stack
Big Data Application Analysis [1] [2] identifies features of data intensive applications that need to be supported in software and represented in benchmarks. This analysis has been extended to support HPC-Simulations-Big Data convergence. The project is a collaboration between computer and domain scientists in application areas in Biomolecular Simulations, Network Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. HPC-ABDS [3] as Cloud-HPC interoperable software with performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack was a bold idea developed. [1] Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha, Geoffrey Fox, A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Proceedings of the 3rd International Congress on Big Data Conference (IEEE BigData 2014). [2] Geoffrey Fox, Shantenu Jha, Judy Qiu, Andre Luckow, Towards an Understanding of Facets and Exemplars of Big Data Applications, accepted to the Twenty Years of Beowulf Workshop (Beowulf), 2014. [3] Judy Qiu, Shantenu Jha, Andre Luckow, Geoffrey Fox, Towards HPC-ABDS: An Initial High-Performance Big Data Stack, accepted to the proceedings of ACM 1st Big Data Interoperability Framework Workshop: Building Robust Big Data ecosystem, NIST special publication, 2014.

5 Scalable Parallel Interoperable Data Analytics Library
MIDAS integrating middleware that links HPC and ABDS now has several components including an architecture for Big Data analytics, an integration of HPC in communication and scheduling on ABDS; it also has rules to get high performance Java scientific code. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) now has 20 members with domain specific and core algorithms. Language: SPIDAL Java runs as fast as C++ Designed and Proposed HPCCloud as hardware-software infrastructure supporting Big Data Big Simulation Convergence Big Data Management via Apache Stack ABDS Big Data Analytics using SPIDAL and other libraries

6 Motivation for faster and bigger problems
Machine Learning Needs high performance Big data and Big model Iterative algorithms are fundamental in learning a non-trivial model Model training and Hyper-parameter tuning steps run the iterative algorithms many times Architecture for Big Data analytics to understand the algorithms through a Model-Centric view to focus on the computation and communication patterns for optimizations Trade-offs of efficiency and productivity linear speedup with an increasing number of processors easier to be parallelized on multicore or manycore computers

7 Motivation of Iterative MapReduce
Classic Parallel Runtimes (MPI) Data Centered, QoS Efficient and Proven techniques Expand the Applicability of MapReduce to more classes of Applications Input Output map Map-Only Input map reduce MapReduce Input map reduce iterations Iterative MapReduce Pij MPI and Point-to-Point Sequential Input Output map

8 Programming Model for Iterative MapReduce
Loop Invariant Data Loaded only once Configure() Cacheable map/reduce tasks (in memory) Intermediate data Main Program Twister was our initial implementation with first paper having 854 Google Scholar citations Map(Key, Value) while(..) { runMapReduce(..) } Faster intermediate data transfer mechanism Reduce (Key, List<Value>) Combiner operation to collect all reduce outputs Combine(Map<Key,Value>) Iterative MapReduce is a programming model that applies a computation (e.g. Map task) or function repeatedly, using output from one iteration as the input of the next iteration. By using this pattern, it can solve complex computation problems by using apparently simple (user defined) functions. Distinction on loop invariant (e.g. input) data and variable (e.g. intermediate) data Cacheable map/reduce tasks (in-memory) Combine operation

9 MapReduce Optimized for Iterative Computations
The speedy elephant Abstractions In-Memory [3] Cacheable map/reduce tasks Data Flow [3] Iterative Loop Invariant Variable data Thread [3] Lightweight Local aggregation Map-Collective [5] Communication patterns optimized for large intermediate data transfer Portability [4][5] HPC (Java) Azure Cloud (C#) Supercomputer (C++, Java) Three PhD thesis/publications: [3] J. Ekanayake et. al, “Twister: A Runtime for Iterative MapReduce”, in Proceedings of the 1st International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference. [4] T. Gunarathne et. al, “Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure”, in Proceedings of 4th IEEE International Conference on Utility and Cloud Computing (UCC 2011). [5] B. Zhang et. al, “Harp: Collective Communication on Hadoop,” in Proceedings of IEEE International Conference on Cloud Engineering (IC2E 2015).

10 The Concept of Harp Plug-in
Parallelism Model Architecture Shuffle M Collective Communication R MapCollective Model MapReduce Model YARN MapReduce V2 Harp MapReduce Applications MapCollective Applications Application Framework Resource Manager Harp is an open-source project developed at Indiana University [6], it has: MPI-like collective communication operations that are highly optimized for big data problems. Harp has efficient and innovative computation models for different machine learning problems. [6] Harp project. Available at

11 Harp-DAAL Framework for Multi-core and Many-core architectures
Bridge the gap between HPC hardware and Big data/Machine learning Software Support Iterative Computation, Collective Communication, Intel DAAL and native kernels Portable to new many-core architectures like Xeon Phi and run on Haswell and KNL clusters

12 Harp-DAAL: Machine Learning Algorithms
witin Hadoop Clusters on Multi/Many-core Hardware Bridge the gap between HPC hardware and Big data/Machine learning Software Iterative MapReduce + Collective Communication + Intel DAAL HPC native kernels Good support to new many-core architecture like Intel Xeon Phi KNL

13 1. Introduction: HPC-ABDS, Harp (Hadoop plug in), DAAL
Outline Introduction: HPC-ABDS, Harp (Hadoop plug in), DAAL Optimization Methodologies Results (configuration, benchmark) Code Optimization Highlights Conclusions and Future Work

14 General Reduction in Hadoop, Spark, Flink
Map Tasks Comparison of Reductions: Separate Map and Reduce Tasks Switching tasks is expensive MPI only has one sets of tasks for map and reduce MPI achieves AllReduce by interleaving multiple binary trees MPI gets efficiency by using shared memory intra-node (e.g. multi-/manycore, GPU) Reduce Tasks Output partitioned with Key Follow by Broadcast for AllReduce which is a common approach to support iterative algorithms For example, paper [7] 10 learning algorithms can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. [7] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng and Kunle Olukotun, Map-Reduce for Machine Learning on Multicore, in NIPS

15 HPC Runtime versus ABDS distributed Computing Model on Data Analytics
Hadoop writes to disk and is slowest; Spark and Flink spawn many processes and do not support allreduce directly; MPI does in-place combined reduce/broadcast

16 Illustration of In-Place AllReduce in MPI

17 Why Collective Communications for Big Data Processing?
Collective Communication and Data Abstractions Optimization of global model synchronization ML algorithms: convergence vs. consistency Model updates can be out of order Hierarchical data abstractions and operations Map-Collective Programming Model Extended from MapReduce model to support collective communications BSP parallelism at Inter-node vs. Intra-node levels Harp Implementation A plug-in to Hadoop

18 Harp APIs Scheduler DynamicScheduler StaticScheduler Collective
broadcast reduce allgather allreduce regroup push pull rotate Event Driven getEvent waitEvent sendEvent

19 Taxonomy for Machine Learning Algorithms
Optimization and related issues Task level only can't capture the traits of computation Model is the key for iterative algorithms, it's structure, e.g., vectors, matrix, tree, matrices, and size are critical for performance Solver has specific computation and communication pattern We investigate different computation and communication patterns of important ml algorithms

20 Parallel Machine Learning Application Implementation Guidelines
Latent Dirichlet Allocation, Matrix Factorization, Linear Regression… Algorithm Expectation-Maximization, Gradient Optimization, Markov Chain Monte Carlo… Computation Model Locking, Rotation, Allreduce, Asynchronous System Optimization Collective Communication Operations Load Balancing on Computation and Communication Per-Thread Implementation

21 Computation Models

22 Harp Solution to Big Data Problems
Distributed Memory broadcast reduce allgather allreduce regroup push & pull rotate sendEvent getEvent waitEvent Communication Operations Computation Models Model-Centric Synchronization Paradigm Computation Models Locking Rotation Asynchronous Allreduce Harp-DAAL High Performance Library Shared Memory Schedulers Dynamic Scheduler Static Scheduler

23 Example: K-means Clustering
The Allreduce Computation Model Model Worker When the model size cannot be held in each machine’s memory When the model size is large but can still be held in each machine’s memory When the model size is small allgather regroup push & pull broadcast reduce allreduce rotate

24 An Parallelization Solution Example Model Rotation
Training Data 𝑫 on HDFS Load, Cache & Initialize 3 Iteration Control Worker 2 Worker 1 Worker 0 Local Compute 1 2 Rotate Model Model 𝑨 𝟎 𝒕 𝒊 Model 𝑨 𝟏 𝒕 𝒊 Model 𝑨 𝟐 𝒕 𝒊 Training Data 𝑫 𝟎 Training Data 𝑫 𝟏 Training Data 𝑫 𝟐 Goals: Maximizing the effectiveness of parallel model updates for algorithm convergence Minimizing the overhead of communication for scaling

25 1. Introduction: HPC-ABDS, Harp (Hadoop plug in), DAAL
Outline Introduction: HPC-ABDS, Harp (Hadoop plug in), DAAL Optimization Methodologies Results (configuration, benchmark) Code Optimization Highlights Conclusions and Future Work

26 Hardware specifications

27 Harp-LDA Test Plan and Datasets
Documents Words Tokens LDA Parameters clueweb 76,163,963 999,933 29,911,407,874 K = α = 0.01 β = 0.01 enwiki 3,775,554 1,000,000 1,107,903,672 Test Harp LDA (Latent Dirichlet Allocation) and other state-of-the-art implementations Petuum Strads LDA Yahoo! LDA Test with the same number of threads on Haswell and KNL Throughput scalability test on Haswell and KNL

28 LDA and Topic model for Text Mining
Topic Models is a modeling technique, modeling the data by probabilistic generative process. Latent Dirichlet Allocation (LDA) is one widely used topic model. Inference algorithm for LDA is an iterative algorithm using share global model data. Document Word Topic: semantic unit inside the data Topic Model documents are mixtures of topics, where a topic is a probability distribution over words Global Model Data 3.7 million docs 10k topics 1 million words Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). Normalized co-occurrence matrix Mixture components Mixture weights

29 The Comparison of LDA CGS Model Convergence Speed
The parallelization strategy can highly affect the algorithm convergence and the system efficiency. This brings us four questions: What part of the model needs to be synchronized? The parallelization needs to decide which model parts needs synchronization. When should the model synchronization happen? In the parallel execution timeline, the parallelization should choose the time point to perform model synchronization. Where should the model synchronization occur? The parallelization needs to tell the distribution of the model among parallel components, what parallel components are involved in the model synchronization. How is the model synchronization performed? The parallelization needs to explain the abstraction and the mechanism of the model synchronization. rtt & Petuum: rotate model parameters lgs & lgs-4s: one or more rounds of model synchronization per iteration Yahoo!LDA: asynchronously fetch model parameters

30 Code Optimization Highlights
Choose the proper computation model Use optimized collective communication Improve load balancing between parallel nodes Optimize data structures and loop structures Improve caching and avoid re-calculation

31 clueweb: 720 threads in total
Haswell 24x30 KNL 12x60 KNL is four times slower than Haswell when the number of threads is equal

32 enwiki: 300 threads in total
Haswell 10x30 KNL 5x60

33 Scalability Test on Haswell
2x30 4x30 8x30 16x30

34 Haswell Throughput Scalability on enwiki
Harp-LDA has the highest throughput, but the speedup is a little lower than the other implementations

35 Scalability Test on KNL (1)
2x60 4x60

36 Scalability Test on KNL (2)
8x60 16x60

37 KNL Throughput Scalability on enwiki
Harp-LDA has the highest throughput, but the speedup is a little lower than the other implementations

38 Code Optimization Highlights
Apply high-performance inter/intra-node parallel computation model Use model rotation in both inter-node and intra-node Dynamic control on model synchronization Time controlled synchronization Utilize efficient data structures Use arrays to store topic counts Combine arrays and maps for fast topic search and traversal during the sampling procedure Loop optimization for token sampling Loop through words and then the documents Sort topic counts to let the topic with large count come first and reduce the time spent on looping Caching intermediate results in the sampling process Avoid recalculating topic probabilities during the sampling procedure

39 Conclusion Harp LDA is 2x~4x faster than Petuum Strads LDA and more than 10x faster than Yahoo! LDA Harp LDA has higher throughput of token sampling than Petuum Strads LDA and Yahoo! LDA at all the scales The performance gain comes from a collection of system optimizations through understanding the time complexity bounds in different stages of the SparseLDA algorithm. When the total number of threads is equal, KNL machines are about four times slower than Haswell machines. It concludes that, for LDA, when the total number of processors is equal, KNL machines (each use 60 of 68 cores) has the similar performance compared with Haswell machines (each has two processors and each processor uses 15 of 18 cores)

40 Harp-DAAL Applications
Harp-DAAL-Kmeans Harp-DAAL-SGD Harp-DAAL-ALS Matrix Factorization Huge model data Random Memory Access Easy to scale up Hard to parallelize Matrix Factorization Huge model data Regular Memory Access Easy to parallelize Hard to scale up Clustering Vectorized computation Small model data Regular Memory Access

41 Computation models for K-means
Harp-DAAL-Kmeans Inter-node: Allreduce, Easy to implement, efficient when model data is not large Intra-node: Shared Memory, matrix-matrix operations, xGemm: aggregate vector-vector distance computation into matrix-matrix multiplication, higher computation intensity (BLAS-3)

42 Computation models for MF-SGD
Inter-node: Rotation Intra-node: Asynchronous Rotation: Efficent when the mode data Is large, good scalability Asynchronous: Random access to model data, good For thread-level workload balance

43 Computation Models for ALS
Inter-node: Allreduce Intra-node: Shared Memory, Matrix operations xSyrk: symmetric rank-k update

44 Performance on Single Node
Harp-DAAL-Kmeans vs. Spark-Kmeans: ~ 20x speedup Harp-DAAL-Kmeans invokes MKL matrix operation kernels at low level Matrix data stored in contiguous memory space, leading to regular access pattern and data locality Harp-DAAL-SGD vs. NOMAD-SGD Small dataset (MovieLens, Netflix): comparable perf Large dataset (Yahoomusic, Enwiki): 1.1x to 2.5x, depending on data distribution of matrices Harp-DAAL-ALS vs. Spark-ALS 20x to 50x speedup Harp-DAAL-ALS invokes MKL at low level Regular memory access, data locality in matrix operations Harp-DAAL has much better single node performance than Java solution (Spark-Kmeans, Spark-ALS) and comparable performance to state-of-arts C++ solution (NOMAD-SGD)

45 Performance on Multi-Nodes
Harp-DAAL-Kmeans: 15x to 20x speedup over Spark-Kmeans Fast single node performance Near-linear strong scalability from 10 to 20 nodes After 20 nodes, insufficient computation workload leads to some loss of scalability Harp-DAAL-SGD: 2x to 2.5x speedup over NOMAD-SGD Comparable or fast single node performance Collective communication operations in Harp-DAAL outperform point-to-point MPI communication in NOMAD Harp-DAAL-ALS: 25x to 40x speedup over Spark-ALS Fast single node performance ALS algorithm is not scalable (high communication ratio) Harp-DAAL combines the benefits from local computation (DAAL kernels) and communication operations (Harp), which is much better than Spark solution and comparable to MPI solution.

46 Breakdown of Intra-node Performance
Thread scalability: Harp-DAAL best threads number: 64 (K-means, ALS) and 128 (MF-SGD), more than 128 threads no performance gain communications between cores intensify cache capacity per thread also drops significantly Spark best threads number 256, because Spark could not fully Utilize AVX-512 VPUs NOMAD-SGD could use AVX VPU, thus has 64 its best thread as that of Harp-DAAL-SGD

47 Breakdown of Intra-node Performance
Spark-Kmeans and Spark-ALS dominated by Computation (retiring), no AVX-512 to reduce retiring Instructions, Harp-DAAL improves L1 cache bandwidth utilization due to AVX-512

48 Data Conversion Challenge
Code Optimization Highlights Data Conversion Challenge Harp Data DAAL Java API DAAL Native Kernel Table<Obj> Data on JVM Heap NumericTable Data on JVM heap Data on Native Memory MicroTable Data on Native Memory A single DirectByteBuffer has a size limite of 2 GB Store data at DAAL Java API in two ways Keep Data on JVM heap (no contiguous memory access requirement ) Small size DirectByteBuffer and parallel copy (OpenMP) Keep Data on Native Memory ( contiguous memory access requirement ) Large size DirectByteBuffer and bulk copy

49 Data Structures of Harp & Intel’s DAAL
Table<Obj> in Harp has a three-level data Hierarchy Table: consists of partitions Partition: partition id, container Data container: wrap up Java objs, primitive arrays Data in different partitions, non-contiguous in memory Harp Table consists of Partitions NumericTable in DAAL stores data either in Contiguous memory space (native side) or non-contiguous arrays (Java heap side) Data in contiguous memory space favors matrix operations with regular memory accesses. DAAL Table has different types of Data storage

50 Two Types of Data Conversion
JavaBulkCopy: Dataflow: Harp Table<Obj> ----- Java primitive array ---- DiretByteBuffer ---- NumericTable (DAAL) Pros: Simplicity in implementation Cons: high demand of DirectByteBuffer size NativeDiscreteCopy: Dataflow: Harp Table<Obj> ---- DAAL Java API (SOANumericTable) ---- DirectByteBuffer ---- DAAL native memory Pros: Efficiency in parallel data copy Cons: Hard to implement at low-level kernels

51 Outline 1. Introduction: HPC-ABDS, Harp (Hadoop plug in), DAAL
Optimization Methodologies Results (configuration, benchmark) Code Optimization Highlights Conclusions and Future Work

52 Six Computation Paradigms for Data Analytics
(1) Map Only (4) Point to Point or Map - Communication (3) Iterative Map Reduce or Collective (2) Classic Map-Reduce Input map reduce Iterations Output Local Graph (5) Streaming maps brokers Events (6) Shared memory Map- Map & Communication Shared Memory Pleasingly Parallel BLAST Analysis Local Machine Learning High Energy Physics (HEP) Histograms, Web search Recommender Engines Expectation Maximization Clustering Linear Algebra PageRank Classic MPI PDE Solvers and Particle Dynamics Streaming images from Synchrotron sources, Telescopes, Internet of Things Difficult to parallelize asynchronous parallel These 3 Paradigms are our Focus

53 Core Machine Learning Algorithms
1. MB_KMeans: minibatch K-Means 2. LR: Logistic Regression 3. Lasso: linear regression with l1 regulizer 4. SVM: support vector machine 5. FM: Factorization Machine 6. CRFs: conditional random fields 7. NN: neural networks 8. ID3/C4.5: decision tree 9. LDA: latent dirichlet allocation 10. KNN: K-nearest neighbor They have: Iterative computation workload High volume of training & model data

54 Summary of the Candidate Algorithms

55 Scalable Algorithms implemented using Harp
Category Applications Features Computation Model Collective Communication K-means Clustering Most scientific domain Vectors AllReduce allreduce, regroup+allgather, broadcast+reduce, push+pull Rotation rotate Multi-class Logistic Regression Classification Vectors, words regroup, rotate, allgather Random Forests allreduce Support Vector Machine Classification, Regression Neural Networks Image processing, voice recognition Latent Dirichlet Allocation Structure learning (Latent topic model) Text mining, Bioinformatics, Image Processing Sparse vectors; Bag of words Matrix Factorization Structure learning (Matrix completion) Recommender system Irregular sparse Matrix; Dense model vectors Multi-Dimensional Scaling Dimension reduction Visualization and nonlinear identification of principal components allgarther, allreduce Subgraph Mining Graph Social network analysis, data mining, fraud detection, chemical informatics, bioinformatics Graph, subgraph Force-Directed Graph Drawing Social media community detection and visualization

56 Conclusions Identification of Apache Big Data Software Stack and integration with High Performance Computing Stack to give HPC-ABDS ABDS (Many Big Data applications/algorithms need HPC for performance) HPC (needs software model productivity/sustainability)  Identification of 4 computation models for machine learning applications Locking, Rotation, Allreduce, Asynchroneous HPC-ABDS Plugin Harp: adds HPC communication performance and rich data abstractions to Hadoop by development of Harp library of Collectives to use at Reduce phase Broadcast and Gather needed by current applications Discover other important ones (e.g. Allgather, Global-local sync, Rotation pipeline)

57 Hadoop/Harp-DAAL: Prototype and Production Code
Source codes available on Indiana University’s Github account An example of MF-SGD is at Harp-DAAL follows the same standard of DAAL’s original codes improve DAAL’s existed algorithms for distributed usage add new algorithms to DAAL’s codebase. Harp-DAAL’s kernel is also compatible with other communication tools.

58 Future Work High performance Harp-DAAL data analysis applications
Online Clustering with Harp or Storm integrates parallel and dataflow computing models Start HPC incubator project in Apache to bring HPC-ABDS to community Integration of Hadoop/Harp with Intel DAAL and other libraries Implement efficiently on each platform (e.g. Amazon, Azure, Big Red II, Haswell/KNL Clusters)

59 Candidates with Distributed codes Candidates with Batch codes
Plan B A survey and benchmarking work for Machine learning algorithms. We download and run benchmark algorithms from state-of-arts machine learning libraries and evaluate their performance on different platforms (Xeon, Xeon Phi, and GPU) Plan A Development of Harp-DAAL applications. DAAL provides batch or distributed C/C++ codes and Java Interface for the following applications: Candidates with Distributed codes Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Neural Networks Candidates with Batch codes Cholesky Decomposition QR Decomposition Expectation-Maximization Multivariate Outlier Detection Univariate Outlier Detection Association Rules Support Vector Machine Classifier (SVM)

60 Adding HPC to Storm & Heron for Streaming
Simultaneous Localization and Mapping Robotics Applications N-Body Collision Avoidance Time series data visualization in real time Robot with a Laser Range Finder Robots need to avoid collisions when they move Map Built from Robot data Map High dimensional data to 3D visualizer Apply to Stock market data tracking 6000 stocks

61 Benchmark communication on Haswell vs. KNL clusters
Plan C Benchmark communication on Haswell vs. KNL clusters Large messages Small messages Parallelism of 2 and using 8 Nodes Parallelism of 2 and using 4 Nodes Intel Haswell Cluster with 2.4GHz Processors and 56Gbps Infiniband and 1Gbps Ethernet Intel KNL Cluster with 1.4GHz Processors and 100Gbps Omni-Path and 1Gbps Ethernet

62 School of Informatics and Computing
Acknowledgements We gratefully acknowledge support from Intel Parallel Computing Center (IPCC) Grant. Langshi Cheng Bo Peng Kannan Govindarajan Bingjing Zhang Supun Kamburugamuve Yiming Zhou Ethan Li Mihai Avram Vibhatha Abeykoon Yining Wang Prawal Gangwar Bingi Raviteja Anurag Sharma Manpreet Gulia Bingi Ravi Teja SALSA HPC Group School of Informatics and Computing Indiana University


Download ppt "Intelligent Systems Engineering Department, Indiana University"

Similar presentations


Ads by Google