OpenFOAM on a GPU-based Heterogeneous Cluster

Slides:

Advertisements

Similar presentations

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Supporting GPU Sharing in Cloud Environments with a Transparent

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Trip report: GPU UERJ Felice Pantaleo SFT Group Meeting 03/11/2014 Felice Pantaleo SFT Group Meeting 03/11/2014.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Data Structures and Algorithms in Parallel Computing Lecture 7.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

CS 420 Design of Algorithms Parallel Algorithm Design.

Sunpyo Hong, Hyesoon Kim

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Evolution at CERN E. Da Riva1 CFD team supports CERN development 19 May 2011.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Auburn University

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Ioannis E. Venetis Department of Computer Engineering and Informatics

Parallel Algorithm Design

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

1CECA, Peking University, China

Chapter 01: Introduction

Parallel Programming in C with MPI and OpenMP

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

OpenFOAM on a GPU-based Heterogeneous Cluster Rajat Phull, Srihari Cadambi, Nishkam Ravi and Srimat Chakradhar NEC Laboratories America Princeton, New Jersey, USA. www.nec-labs.com

OpenFOAM Overview OpenFOAM stands for: ‘Open Field Operations And Manipulation’ Consists of a library of efficient CFD related C++ modules These can be combined together to create solvers utilities (for example pre/post-processing, mesh checking, manipulation, conversion, etc)

OpenFOAM Application Domain: Examples Buoyancy driven flow: Temperature flow Fluid Structure Interaction Modeling capabilities used by aerospace, automotive, biomedical, energy and processing industries.

OpenFOAM on a CPU clustered version Domain decomposition: Mesh and associated fields are decomposed. Scotch Practitioner

Motivation for GPU based cluster OpenFOAM solver on a CPU based cluster Each node: Quad-core 2.4GHz processor and 48GB RAM Performance degradation with increasing data size

This Paper Ported a key OpenFOAM solver to CUDA. Compared performance of OpenFOAM solvers on CPU and GPU based clusters Around 4 times faster on GPU based cluster Solved the imbalance due to different GPU generations in the cluster. A run-time analyzer to dynamically load balance the computation by repartitioning the input data.

How We Went About Designing the Framework Profiled representative workloads Computational Bottlenecks CUDA implementation for clustered application Imbalance due to different generation of GPUs or nodes without GPU Load balance the computations by repartitioning the input data

InterFOAM Application Profiling Straightaway porting on GPU  Additional data transfer per iteration. Avoid data transfer each iterationHigher granularity to port entire solver on the GPU Computational Bottleneck

PCG Solver Iterative algorithm for solving linear systems Solves Ax=b Each iteration computes vector x and r, the residual r is checked for convergence.

InterFOAM on a GPU-based cluster Convert Input matrix A from LDU to CSR Transfer A, x0 and b to GPU memory Kernel for Diagonal preconditioning CUBLAS APIs for linear algebra operations. CUSPASE for matrix vector multiplication No Communication requires intermediate vectors in host memory. Scatter and Gather kernels reduces data transfer. Converged? Yes Transfer vector x to host memory

Cluster with Identical GPUs : Experimental Results (I) Problem Size Time(s) 1 Node (4-cores) 2 Nodes (8-cores) 3 Node (12-cores) (2 CPU cores + 2-GPUs) (4 CPU cores + 4-GPUs) 3 Nodes (6 CPU cores + 6-GPUs) 159500 46 36 32 88 87 106 318500 153 85 70 146 142 165 637000 527 337 222 368 268 320 955000 1432 729 498 680 555 489 2852160 20319 11362 5890 4700 3192 2900 4074560 39198 19339 12773 7388 4407 4100 Node: A quad-core Xeon, 2.4GHz, 48GB RAM + 2x NVIDIA Fermi C2050 GPU with 3GB RAM each.

Cluster with Identical GPUs : Experimental Results (II) Performance: 3-node CPU cluster vs. 2 GPUs Performance: 4-GPU cluster is optimal

Cluster with Different GPUs OpenFOAM employs task parallelism, where the input data is partitioned and assigned to different MPI processes Nodes do not have GPUs or the GPUs have different compute capabilities Iterative algorithms: Uniform domain decomposition can lead to imbalance and suboptimal performance

Heterogeneous cluster : Case for suboptimal performance for Iterative methods Iterative convergence algorithms: Creates parallel tasks that communicate with each other P0 and P1 : Higher compute capability when compared to P2 and P3 Suboptimal performance when data equally partitioned: P0 and P2 complete the computations and wait for P2 and P3 to finish

Case for Dynamic data partitioning on Heterogeneous clusters Runtime analysis + Repartitioning

Why not static partitioning based on compute power of nodes? Inaccurate prediction of optimal data partitioning, especially when GPUs with different memory bandwidths, cache levels and processing elements Multi-tenancy makes the prediction even harder. Data-aware scheduling scheme (selection of computation to be offloaded to the GPU is done at runtime) makes it even more complex.

How Data repartitioning system works?

How Data repartitioning system works?

Model for Imbalance Analysis : In context of OpenFOAM Low communication overhead With Unequal partitions: No significant commn overhead Weighted mean (tw) = ∑ (T[node] * x[node]) / ∑ (x[node]) If T[node] < tw Increase the partitioning ratio on P[node] else Decrease the partitioning ratio on P[node] Processes P[0] P[1] P[2] P[3] Data Ratio x[0] x[1] x[2] x[3] Compute Time T[0] T[1] T[2] T[3]

Data Repartitioning: Experimental Results Problem Size Average Time per Iteration(MS) Work load equally balanced Static partitioning Dynamic repartitioning 159500 1.9 318500 2.4 2.2 637000 2.85 2.7 2.35 955000 5.9 3.15 2.75 2852160 13.05 6.1 5.8 4074560 25.5 8.2 7.2 Node 1 contains 2 CPU cores + 2 C2050 Fermi Node 2 contains 4 CPU cores Node 3 contains 2 CPU cores + 2 Tesla C1060

Summary OpenFOAM solver to a GPU-based heterogeneous cluster. Learning can be extended to other solvers with similar characteristics (domain decomposition, Iterative, sparse computations) For a large problem size, speedup of around 4x on a GPU based cluster Imbalance in GPU clusters caused by fast evolution of GPUs, and propose a run-time analyzer to dynamically load balance the computation by repartitioning the input data.

Future Work Scale up to a larger cluster and perform experiments with multi-tenancy in the cluster Extend this work to incremental data repartitioning without restarting the application Introduce a sophisticated model for imbalance analysis to support large sub-set of applications.

Thank You!