Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

Irregular Reduction - Context A dwarf in Berkeley view on parallel computing – Unstructured grid pattern – More random and irregular accesses – Indirect memory references Previous efforts in porting to different architectures – Distributed memory machines – Distributed shared memory machines – Shared memory machines – Cache performance improvement on uniprocessor – Many-core architecture - GPGPU (Our work in ICS 11) – No system study on heterogeneous architecture (CPU&GPU)

Why CPU + GPU ? – A Glimpse Dominant positions in different areas, but connected tightly – GPU: Computation Intensive problems, large number of threads execute in SIMD – CPU: Data Intensive problems, branch processing, high precision and complicated computation – GPU is dependent on the scheduling and data from CPU One of the most popular heterogeneous architectures – 3 out of 5 fastest supercomputers are based on CPU + GPU architecture (500 list on 11/2011) – Fusion in AMD, Sandy Bridge in Intel, and Denver in NVIDIA – Cloud compute instances in Amazon

Outline Background – Irregular Reduction Structure – Partitioning-based Locking Scheme – Main Issues Contributions Multi-level Partitioning Framework Runtime Support – Pipeline Scheme – Task Scheduling Experiment Support Conclusions

Irregular Reduction Structure Robj: Reduction Object e: Iteration of computation loop IA(e,x): Indirection Array IA: Iterators over e (Computation Space) Robj: Accessed by Indirection Array (Reduction Space) /* Outer Sequence Loop */ while( ) { /* Reduction Loop */ Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); Robj = Reduce(Robj(IA(e,0)),val1); Robj = Reduce(Robj(IA(e,1)),val2); } /*Global Reduction to Combine Robj*/ } /* Outer Sequence Loop */ while( ) { /* Reduction Loop */ Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); Robj = Reduce(Robj(IA(e,0)),val1); Robj = Reduce(Robj(IA(e,1)),val2); } /*Global Reduction to Combine Robj*/ }

Application Context Molecular Dynamics Indirection Array -> Edges (Interactions) Reduction Objects -> Molecules (Attributes) Computation Space -> Interactions b/w molecules Reduction Space -> Attributes of Molecules

Partitioning-based Locking Strategy Reduction Space Partitioning – Efficient shared memory utilization – Eliminate intra and inter-block combination Multi-Dimensional Partitioning Method – Balance between minimum cutting edges and partitioning time Huo et al., 25th International Conference on Supercomputing

Main Issues Device memory limitation on GPU Partitioning overhead – Partitioning cost increases with the increasing of data volume – GPU is in idle for waiting the results of partitioning Low utilization of CPU – CPU only conducts partitioning – CPU is in idle when GPU doing computation

Contributions A Novel Multi-level Partitioning Framework – Parallelize irregular reduction on heterogeneous architecture (CPU + GPU) – Eliminate device memory limitation on GPU Runtime Support Scheme – Pipelining Scheme – Work stealing based scheduling strategy Significant Performance Improvements – Exhaustive evaluations – Achieve 11% and 22% improvement for Euler and Molecular Dynamics

Multi-level Partitioning Framework

Computation Space Partitioning Partitioning on the iterations of the computation loop Pros – Load Balance on Computation Cons – Unequal reduction size in each partition – Replicated reduction elements (4 out of 16 nodes) – Combination cost Between CPU and GPU (First Level) Between different thread blocks (Second Level) 1 1 4 4 3 3 5 5 6 6 12 15 7 7 9 9 13 11 14 16 10 8 8 2 2 Partition 1 Partition 2 Partition 3Partition 4 2 2 4 4 12 7 7

Reduction Space Partitioning Pros – Balance reduction space Shared memory is feasible for GPU – Independent between each two partitions No communication between CPU and GPU (First Level) No communication between thread blocks (Second Level) – Avoid combination cost Between different thread blocks Between CPU and GPU Cons – Imbalance on computation space – Replicated work caused by crossing edges 1 1 4 4 3 3 5 5 6 6 12 15 7 7 9 9 13 11 14 16 10 2 2 8 8 Partition 1 Partition 2 Partition 4 Partitioning on Reduction Elements Partition 3

Task Scheduling Framework

Pipelining Scheme – K blocks assigned to GPU in one global loading – Pipelining between Partitioning and Computation in a global loading Work Stealing Scheduling – Scheduling granularity (Large for GPU; Small for CPU) Too large: Better pipelining effect, but worse load balance Too small: Better load balance, but small pipelining length – Work Stealing can achieve both maximum pipelining length and good load balance

Experiment Setup Platform – GPU NVIDIA Tesla C2050 “Fermi” (14x32 = 448 cores) 2.68GB device memory 64KB configurable shared memory – CPU Intel 2.27 GHz Quad Xeon E5520 48GB memory x16 PCI Express 2.0 Applications – Euler (Computational Fluid Dynamics) – MD (Molecular Dynamics)

Scalability – Molecular Dynamics Scalability of Molecular Dynamics on Multi-core CPU and GPU across Different Datasets (MD)

Pipelining – Euler Effect of Pipelining CPU Partitioning and GPU Computation (EU) Performance increasing Pipelining increasing Avoid partitioning overhead except the first partition

Heterogeneous Performance – EU and MD Benefits From Dividing Computations Between CPU and GPU for EU and MD 11% 22%

Work Stealing - Euler Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies Granularity = 1 Granularity = 5 Good load balance, Bad pipelining effect Good pipelining effect Bad load balance Good pipelining effect Good load balance

Conclusions Multi-level Partitioning Framework to port irregular reduction on heterogeneous architectures Pipelining Scheme can overlap partitioning on CPU and computation on GPU Work Stealing Scheduling achieves the best pipelining effect and load balance

Thank you Questions ? Contacts: Xin Huohuox@cse.ohio-state.eduhuox@cse.ohio-state.edu Vignesh T. Ravi raviv@cse.ohio-state.eduraviv@cse.ohio-state.edu Gagan Agrawalagrawal@cse.ohio-state.eduagrawal@cse.ohio-state.edu

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Similar presentations

Presentation on theme: "Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering."— Presentation transcript:

Similar presentations

About project

Feedback