1 Scalable Fast Multipole Methods on Distributed Heterogeneous Architecture Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
1 Toward Improved Aeromechanics Simulations Using Recent Advancements in Scientific Computing Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced.
1 Scalable Fast Multipole Accelerated Vortex Methods Qi Hu a Nail A. Gumerov a Rio Yokota b Lorena Barba c Ramani Duraiswami a a Department of Computer.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
OpenFOAM on a GPU-based Heterogeneous Cluster
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
A many-core GPU architecture.. Price, performance, and evolution.
Evaluation of Fast Electrostatics Algorithms Alice N. Ko and Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
K-Ary Search on Modern Processors Fakultät Informatik, Institut Systemarchitektur, Professur Datenbanken Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
1 Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
GPU-Accelerated Surface Denoising and Morphing with LBM Scheme Ye Zhao Kent State University, Ohio.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
Sunpyo Hong, Hyesoon Kim
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
ChaNGa: Design Issues in High Performance Cosmology
Introduction to Parallelism.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.
Course Outline Introduction in algorithms and applications
Supported by the National Science Foundation.
Hybrid Programming with OpenMP and MPI
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

1 Scalable Fast Multipole Methods on Distributed Heterogeneous Architecture Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer Science University of Maryland, College Park, MD

2 Previous work FMM on distributed systems -Greengard and Gropp (1990) discussed parallelizing FMM -Ying, et al. (2003): the parallel version of kernel independent FMM FMM on GPUs -Gumerov and Duraiswami (2008) explored the FMM algorithm for GPU -Yokota, et al. (2009) presented FMM on the GPU cluster Other impressive results use the benefits of architecture tuning on the networks of multi-core processors or GPUs -Hamada, et al. (2009, 2010): the Golden Bell Prize SC’09 -Lashuk, et al. (2009) presented kernel independent adaptive FMM on heterogeneous architectures -Chandramowlishwaran, et al. (2010): optimizations for multi-core clusters -Cruz, et al. (2010): the PetFMM library

3 Issues with previous results FMM algorithm implementations demonstrated scalability in restricted range -Scalability was shown for less accurate tree-codes Papers did not address issue of the re-computation of neighbor lists at each step -Important for dynamic problems that we are interested in Did not use both the CPU and GPU which occur together in modern architectures

4 Contributions Efficient scalable parallel FMM algorithms -Use both multi-core CPUs and GPUs -First scalable FMM algorithm on heterogeneous clusters or GPUs -Best timing for a single workstation Extremely fast parallel algorithms for FMM data structures -Complexity O(N) and much faster than evaluation steps -Suitable for dynamics problems Algorithms achieve 38 Tflops on 32 nodes (64 GPUs) -Demonstrate strong and weak scalability -Best scalability per GPU (>600 Gflops/GPU) -FMM with billion particles on a midsized cluster

5 Motivation: Brownout Complicated phenomena involving interaction between rotorcraft wake, ground, and dust particles Causes accidents due to poor visibility and damage to helicopters Understanding can lead to mitigation strategies Lagrangian (vortex element) methods to compute the flow Fast evaluation of the fields at particle locations Need for fast evaluation of all pairwise 3D interactions

6 Motivation Many other applications require fast evaluation of pairwise interactions with 3D Laplacian kernel and its derivatives Astrophysics (gravity potential and forces) wissrech.ins.uni-bonn.de Molecular Dynamics (Coulomb potential and forces) Micro and Nanofluidics (complex channel Stokes flows) Imaging and Graphics (high quality RBF interpolation) Much More!

7 Introduction to fast multipole methods Problem: compute matrix-vector product of some kernels Linear computation and memory cost O(N+M) with any accuracy Divide the sum to the far field and near field terms Direct kernel evaluations for the near field Approximations of the far field sum via the multipole expansions of the kernel function and spatial data structures (octree for 3D)

8 Introduction to the fast multipole method The local and multipole expansions of the Laplace kernel at the center with the truncation number p Expansions regions are validated by well separated pairs realized using spatial boxes of octree (hierarchical data structures) Translations of expansion coefficients -Multipole to multipole translations (M|M) -Multipole to local translations (M|L) -Local to local translations (L|L) r n Y nm local spherical basis functions r −(n+1) Y nm multipole spherical basis functions

9 FMM flow chart 1.Build data structures 2.Initial M-expansions 3.Upward M|M translations 4.Downward M|L, L|L translations 5.L-expansions 6.Local direct sum (P2P) and final summation From Java animation of FMM by Y. Wang M.S. Thesis, UMD 2005

10 Novel parallel algorithm for FMM data structures Data structures for assigning points to boxes, find neighbor lists, retaining only non empty boxes Usual procedures use a sort, and have O(N log N) cost Present: parallelizable on the GPU and has O(N) cost -Modified parallel counting sort with linear cost -Histograms: counters of particles inside spatial boxes -Parallel scan: perform reduction operations -Costs significantly below cost of FMM step Data structures passed to the kernel evaluation engine are compact, i.e. no empty box related structures

11 Performance Depth of the FMM octree (levels) FMM data structures on the GPU for millions of particles in 0.1 s as opposed to 2-10 s required for CPU. Substantial computational savings for dynamic problems, where particle positions change and data structure need to be regenerated ay each time step

12 Heterogeneous architecture

13 Mapping the FMM on CPU/GPU architecture GPU is a highly parallel, multithreaded, many-core processor -Good for repetitive operations on multiple data (SIMD) CPUs are good for complex tasks with -Complicated data structures, such as FMM M|L translation stencils, with complicated patterns of memory access CPU-GPU communications expensive Profile FMM and determine which parts of FMM go where DRAM Cache Control ALU CPU A few cores DRAM GPU Hundreds of cores

14 FMM on the GPU Look at implementation of Gumerov & Duraiswami (2008) -M2L translation cost: 29%; GPU speedup 1.6x -Local direct sum: 66.1%; GPU speedup 90.1x Profiling data suggests -Perform translations on the CPU: multicore parallelization and large cache provides comparable or better performance -GPU computes local direct sum (P2P) and particle related work: SIMD

15 Single node algorithm

16 Advantages CPU and GPU are tasked with their most efficient jobs -Faster translations: CPU code can be better optimized using complex translation stencil data structures -High accuracy, double precision, CPU translation without much cost penalty -Faster local direct sum: many cores on GPU; same kernel evaluation but on multiple data (SIMD) The CPU is not idle during the GPU computations Easy visualization: all particle in GPU Smaller data transfer between the CPU and GPU

17 GPU Visualization and Steering

18 Single node tests Dual quad-core Intel Nehalem GHz processors 24 GB of RAM Two Tesla C1060 GPU

19 Dividing the FMM algorithm on different nodes Divide domain and assign each piece to separate nodes (work regions) Use linearity of translations and spatial decomposition property of FMM to perform algorithm correctly target box Node 0 Node 3 Node 1 Node 2

20 The algorithm flow chart Master collects receiver boxes and distributes work regions (work balance) Assigns particle data according to assigned work regions M|M for local non-empty receiver boxes while M|L and L|L for global non-empty receiver boxes L-coefficients efficiently sent to master node in binary tree order Node K ODE solver: source receiver update positions, etc. data structure (receivers) merge data structure (sources) assign particles to nodes single heterogeneous node algorithm final sum exchange final L-expansions

21 Scalability issues M|M and M|L translations are distributed among all nodes Local direct sums are not repeated L|L translations are repeated -Normally, M|L translations takes 90% overall time and L|L translation costs are negligible -Amdahl’s Law: affects overall performance when P is large -Still efficient for small clusters (1~64 nodes) Current fully scalable algorithm performs distributed L|L translation -Further divides boxes into four categories -Much better solution is using our recent new multiple node data structures and algorithms -Hu et al., submitted

22 Weak scalability Fix 8M per node Run tests on 1 ~ 16 nodes The depth of the octree determines the overhead The particle density determines the parallel region timing

23 Strong scalability Fix the problem size to be 8M particles Run tests on 1 ~ 16 nodes Direct sum dominates the computation cost -Unless GPU is fully occupied algorithm does not achieve strong scalability -Can choose number of GPUs on a node according to size

24 The billion size test case Using all 32 Chimera nodes and 64 GPUs 2 30 ~1.07 billion particles potential computation in 21.6 s -32 M per node Each node: Dual quad-core Intel Nehalem GHz processors 24 GB of RAM Two Tesla C1060 GPU

25 Performance count SC’11SC’10SC’09 PaperHu et al Hamada and Nitadori, 2010 Hamada, et al AlgorithmFMMTree code Problem size 1,073,741,8243,278,982,5961,608,044,129 Flops count 38 TFlops on 64 GPUs, 32 nodes 190 TFlops on 576 GPUs, 144 nodes TFlops on 256 GPUs, 128 nodes GPU Tesla C1060: GHz 240 cores GTX GHz 2 x 240 cores GeForce 8800 GTS: GHz 96 cores 342 TFlops on 576 GPUs

26 Conclusion Heterogeneous scalable (CPU/GPU) FMM for single nodes and clusters. Scalability of the algorithm is tested and satisfactory results are obtained for midsize heterogeneous clusters Novel algorithm for FMM data structures on GPUs -Fixes a bottleneck for large dynamic problems Developed code will be used in solvers for many large scale problems in aeromechanics, astrophysics, molecular dynamics, etc.

27 Questions? Acknowledgments

28 Backup Slides

29 Accuracy test Test performed for potential kernel NVIDIA Tesla C2050 GPU accelerator with 3 GB Intel Xeon E5504 processor at 2.00 GHz with 6GB RAM

30 Bucket-sort on GPU Source/receiver data points array Each data point i has a 2D vector Each box j has a counter Parallel scan The rank of data point j:

31 Parallel scan operation An array Compute, where