Looking at Entire Programs:

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
Efficient Reference Counting for Nested Task- and Data-Parallelism MMnet ’13 8. May., Edinburgh Sven-Bodo Scholz.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Processing with OpenMP
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
On the Locality of Java 8 Streams in Real- Time Big Data Applications Yu Chan Ian Gray Andy Wellings Neil Audsley Real-Time Systems Group, Computer Science.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
OpenMP China MCP.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Data-Parallel Programming using SaC lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
SaC – Functional Programming for HP 3 Chalmers Tekniska Högskola Sven-Bodo Scholz.
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Single Instruction Multiple Threads
Gwangsun Kim, Jiyun Jeong, John Kim
18-447: Computer Architecture Lecture 30B: Multiprocessors
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CS427 Multicore Architecture and Parallel Computing
Distributed Processors
Shared Memory Parallelism - OpenMP
Enabling machine learning in embedded systems
Multicore, Multithreaded, Multi-GPU Kernel VSIPL Standardization, Implementation, & Programming Impacts Anthony Skjellum, Ph.D.
Multicore Programming Final Review
3- Parallel Programming Models
Computer Engg, IIT(BHU)
X10: Performance and Productivity at Scale
Lecture 5: GPU Compute Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Pipeline parallelism and Multi–GPU Programming
Software Cache Coherent Control by Parallelizing Compiler
Department of Computer Science University of California, Santa Barbara
Lecture 5: GPU Compute Architecture for the last time
Kernel Synchronization II
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Shared Memory Programming
Does Hardware Transactional Memory Change Everything?
Distributed Systems CS
A Comparison-FREE SORTING ALGORITHM ON CPUs
Lecture 2 The Art of Concurrency
Department of Computer Science University of California, Santa Barbara
6- General Purpose GPU Programming
Presentation transcript:

Data-Parallel Programming using SaC lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz

Looking at Entire Programs:

Multi-Threaded Execution on SMPs data parallel operation

SMP Ideal: Coarse Grain!

Multi-Threaded Execution on GPGPUs data parallel operation

GPGPU Ideal: Homogeneous Flat Coarse Grain

GPUs and synchronous memory transfers

GPGPU Naive Compilation PX = { [i,j] -> PX[ [i,j] ] + sum( VY[ [.,i] ] * CX[ [i,j] ]) }; CUDA kernel PX_d = host2dev( PX); CX_d = host2dev( CX); VY_d = host2dev( VY); PX_d = { [i,j] -> PX_d[ [i,j] ] + sum( VY_d[ [.,i] ] * CX_d[ [i,j] ]) }; PX = dev2host( PX_d);

Hoisting memory transfers out of loops for( i = 1; i< rep; i++) { PX_d = host2dev( PX); CX_d = host2dev( CX); VY_d = host2dev( VY); PX_d = { [i,j] -> PX_d[ [i,j] ] + sum( VY_d[ [.,i] ] * CX_d[ [i,j] ]) }; PX = dev2host( PX_d); PX_d = host2dev( PX); CX_d = host2dev( CX); VY_d = host2dev( VY); PX_d = { [i,j] -> PX_d[ [i,j] ] + sum( VY_d[ [.,i] ] * CX_d[ [i,j] ]) }; PX = dev2host( PX_d); }

Retaining Arrays on the Device CX_d = { [i,j] -> ..... } ; CX = dev2host( CX_d); CX = dev2host( CX_d); PX_d = host2dev( PX); CX_d = host2dev( CX); VY_d = host2dev( VY); PX_d = host2dev( PX); CX_d = host2dev( CX); VY_d = host2dev( VY); PX_d = host2dev( PX); VY_d = host2dev( VY); for( i = 1; i< rep; i++) {

The Impact of Redundant Memory Transfers

Challenge: Memory Management: What does the λ-calculus teach us? f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } conceptual copies

How do we implement this? – the scalar case f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } conceptual copies operation implementation read read from stack funcall push copy on stack

How do we implement this? – the non-scalar case naive approach f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } conceptual copies operation non-delayed copy read O(1) + free update O(1) reuse funcall O(1) / O(n) + malloc

How do we implement this? – the non-scalar case widely adopted approach f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } conceptual copies GC operation delayed copy + delayed GC read O(1) update O(n) + malloc reuse malloc funcall

How do we implement this How do we implement this? – the non-scalar case reference counting approach f( a, b , c) f( a, b ,c ) { ... a..... a....b.....b....c... } conceptual copies operation delayed copy + non-delayed GC read O(1) + DEC_RC_FREE update O(1) / O(n) + malloc reuse O(1) / malloc funcall O(1) + INC_RC

How do we implement this How do we implement this? – the non-scalar case a comparison of approaches operation non-delayed copy delayed copy + delayed GC delayed copy + non-delayed GC read O(1) + free O(1) O(1) + DEC_RC_FREE update O(n) + malloc O(1) / O(n) + malloc reuse malloc O(1) / malloc funcall O(1) + INC_RC

Avoiding Reference Counting Operations b = a[1]; c = f( a, 1); d= a[2]; e = f( a, 2); clearly, we can avoid RC here! and here! we would like to avoid RC here! BUT, we cannot avoid RC here!

NB: Why don’t we have RC-world-domination? 1 1 2 3 3

 Going Multi-Core ... single-threaded rc-op data-parallel local variables do not escape! relatively free variables can only benefit from reuse in 1/n cases! => use thread-local heaps => inhibit rc-ops on rel-free vars 

Bi-Modal RC: fork local norc join

Conclusions There are still many challenges ahead, e.g. Non-array data structures Arrays on clusters Joining data and task parallelism Better memory management Application studies If you are interested in joining the team: talk to me 

 Going Multi-Core II single-threaded rc-op rc-op task-parallel local variables do escape! relatively free variables can benefit from reuse in 1/2 cases!  => use locking....

Going Many-Core 256 cores 500 threads in HW each functional programmers paradise, no?! nested DP and TP parallelism

RC in Many-Core Times computational thread(s) rc-op RC-thread rc-op

and here the runtimes

Multi-Modal RC: spawn

new runtimes: