Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

Optimization on Kepler Zehuan Wang

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

INSTITUTE OF COMPUTING TECHNOLOGY An Adaptive Task Creation Strategy for Work-Stealing Scheduling Lei Wang, Huimin Cui, Yuelu Duan, Fang Lu, Xiaobing Feng,

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

st International Conference on Parallel Processing (ICPP)

OpenFOAM on a GPU-based Heterogeneous Cluster

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Chapter 17 Parallel Processing.

Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Last time: Runtime infrastructure for hybrid (GPU-based) platforms  Task scheduling Extracting performance models at runtime  Memory management Asymmetric.

Supporting GPU Sharing in Cloud Environments with a Transparent

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

My Coordinates Office EM G.27 contact time:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Efficient and Simplified Parallel Graph Processing over CPU and MIC

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Linchuan Chen, Xin Huo and Gagan Agrawal

Linchuan Chen, Peng Jiang and Gagan Agrawal

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Data-Intensive Computing: From Clouds to GPU Clusters

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Gary M. Zoppetti Gagan Agrawal

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Presentation transcript:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

Irregular Reduction - Context A dwarf in Berkeley view on parallel computing – Unstructured grid pattern – More random and irregular accesses – Indirect memory references Previous efforts in porting to different architectures – Distributed memory machines – Distributed shared memory machines – Shared memory machines – Cache performance improvement on uniprocessor – Many-core architecture - GPGPU (Our work in ICS 11) – No system study on heterogeneous architecture (CPU&GPU)

Why CPU + GPU ? – A Glimpse Dominant positions in different areas, but connected tightly – GPU: Computation Intensive problems, large number of threads execute in SIMD – CPU: Data Intensive problems, branch processing, high precision and complicated computation – GPU is dependent on the scheduling and data from CPU One of the most popular heterogeneous architectures – 3 out of 5 fastest supercomputers are based on CPU + GPU architecture (500 list on 11/2011) – Fusion in AMD, Sandy Bridge in Intel, and Denver in NVIDIA – Cloud compute instances in Amazon

Outline Background – Irregular Reduction Structure – Partitioning-based Locking Scheme – Main Issues Contributions Multi-level Partitioning Framework Runtime Support – Pipeline Scheme – Task Scheduling Experiment Support Conclusions

Irregular Reduction Structure Robj: Reduction Object e: Iteration of computation loop IA(e,x): Indirection Array IA: Iterators over e (Computation Space) Robj: Accessed by Indirection Array (Reduction Space) /* Outer Sequence Loop */ while( ) { /* Reduction Loop */ Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); Robj = Reduce(Robj(IA(e,0)),val1); Robj = Reduce(Robj(IA(e,1)),val2); } /*Global Reduction to Combine Robj*/ } /* Outer Sequence Loop */ while( ) { /* Reduction Loop */ Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); Robj = Reduce(Robj(IA(e,0)),val1); Robj = Reduce(Robj(IA(e,1)),val2); } /*Global Reduction to Combine Robj*/ }

Application Context Molecular Dynamics Indirection Array -> Edges (Interactions) Reduction Objects -> Molecules (Attributes) Computation Space -> Interactions b/w molecules Reduction Space -> Attributes of Molecules

Partitioning-based Locking Strategy Reduction Space Partitioning – Efficient shared memory utilization – Eliminate intra and inter-block combination Multi-Dimensional Partitioning Method – Balance between minimum cutting edges and partitioning time Huo et al., 25th International Conference on Supercomputing

Main Issues Device memory limitation on GPU Partitioning overhead – Partitioning cost increases with the increasing of data volume – GPU is in idle for waiting the results of partitioning Low utilization of CPU – CPU only conducts partitioning – CPU is in idle when GPU doing computation

Contributions A Novel Multi-level Partitioning Framework – Parallelize irregular reduction on heterogeneous architecture (CPU + GPU) – Eliminate device memory limitation on GPU Runtime Support Scheme – Pipelining Scheme – Work stealing based scheduling strategy Significant Performance Improvements – Exhaustive evaluations – Achieve 11% and 22% improvement for Euler and Molecular Dynamics

Multi-level Partitioning Framework

Computation Space Partitioning Partitioning on the iterations of the computation loop Pros – Load Balance on Computation Cons – Unequal reduction size in each partition – Replicated reduction elements (4 out of 16 nodes) – Combination cost Between CPU and GPU (First Level) Between different thread blocks (Second Level) Partition 1 Partition 2 Partition 3Partition

Reduction Space Partitioning Pros – Balance reduction space Shared memory is feasible for GPU – Independent between each two partitions No communication between CPU and GPU (First Level) No communication between thread blocks (Second Level) – Avoid combination cost Between different thread blocks Between CPU and GPU Cons – Imbalance on computation space – Replicated work caused by crossing edges Partition 1 Partition 2 Partition 4 Partitioning on Reduction Elements Partition 3

Task Scheduling Framework

Pipelining Scheme – K blocks assigned to GPU in one global loading – Pipelining between Partitioning and Computation in a global loading Work Stealing Scheduling – Scheduling granularity (Large for GPU; Small for CPU) Too large: Better pipelining effect, but worse load balance Too small: Better load balance, but small pipelining length – Work Stealing can achieve both maximum pipelining length and good load balance

Experiment Setup Platform – GPU NVIDIA Tesla C2050 “Fermi” (14x32 = 448 cores) 2.68GB device memory 64KB configurable shared memory – CPU Intel 2.27 GHz Quad Xeon E GB memory x16 PCI Express 2.0 Applications – Euler (Computational Fluid Dynamics) – MD (Molecular Dynamics)

Scalability – Molecular Dynamics Scalability of Molecular Dynamics on Multi-core CPU and GPU across Different Datasets (MD)

Pipelining – Euler Effect of Pipelining CPU Partitioning and GPU Computation (EU) Performance increasing Pipelining increasing Avoid partitioning overhead except the first partition

Heterogeneous Performance – EU and MD Benefits From Dividing Computations Between CPU and GPU for EU and MD 11% 22%

Work Stealing - Euler Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies Granularity = 1 Granularity = 5 Good load balance, Bad pipelining effect Good pipelining effect Bad load balance Good pipelining effect Good load balance

Conclusions Multi-level Partitioning Framework to port irregular reduction on heterogeneous architectures Pipelining Scheme can overlap partitioning on CPU and computation on GPU Work Stealing Scheduling achieves the best pipelining effect and load balance

Thank you Questions ? Contacts: Xin Vignesh T. Ravi Gagan