Waves!. Solving something like this… The Wave Equation (1-D) (n-D)

Slides:



Advertisements
Similar presentations
J ACOBI I TERATIVE TECHNIQUE ON M ULTI GPU PLATFORM By Ishtiaq Hossain Venkata Krishna Nimmagadda.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
Advanced Topics in Algorithms and Data Structures Lecture 7.2, page 1 Merging two upper hulls Suppose, UH ( S 2 ) has s points given in an array according.
CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
A CHAT CLIENT-SERVER MODULE IN JAVA BY MAHTAB M HUSSAIN MAYANK MOHAN ISE 582 FALL 2003 PROJECT.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Threads vs. Processes April 7, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Memory Management April 28, 2000 Instructor: Gary Kimura.
Hashing General idea: Get a large array
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
1 I/O Management in Representative Operating Systems.
The hybird approach to programming clusters of multi-core architetures.
CS 179: GPU Programming Lecture 20: Cross-system communication.
Memory Allocation CS Introduction to Operating Systems.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:
Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:
Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Processes & Threads Emery Berger and Mark Corner University.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 6 System Calls OS System.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
June-Hyun, Moon Computer Communications LAB., Kwangwoon University Chapter 26 - Threads.
Games Development 2 Concurrent Programming CO3301 Week 9.
CSC 211 Data Structures Lecture 13
ITFN 3601 Introduction to Operating Systems Lecture 3 Processes, Threads & Scheduling Intro.
Virtual Memory The memory space of a process is normally divided into blocks that are either pages or segments. Virtual memory management takes.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
CS 179: GPU Programming Lab 7 Recitation: The MPI/CUDA Wave Equation Solver.
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
1 Chapter 6 Methods for Making Data Structures. 2 Dynamic Arrays in Data Structures In almost every data structure, we want functions for inserting and.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Unit1: Modeling & Simulation Module5: Logic Simulation Topic: Unknown Logic Value.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Parallel Computing Presented by Justin Reschke
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Ch. 4 Memory Mangement Parkinson’s law: “Programs expand to fill the memory available to hold them.”
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Processes and threads.
CSC 322 Operating Systems Concepts Lecture - 12: by
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Chapter 9: Virtual Memory
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Main Memory Management
Waves!.
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
CS Introduction to Operating Systems
Chapter 9: Virtual-Memory Management
Lecture#12: External Sorting (R&G, Ch13)
Lecture 14: Inter-process Communication
CS 179: Lecture 12.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CS510 Operating System Foundations
EEC 688/788 Secure and Dependable Computing
EECE.4810/EECE.5730 Operating Systems
Presentation transcript:

Waves!

Solving something like this…

The Wave Equation (1-D) (n-D)

The Wave Equation

Boundary Conditions Examples: – Manual motion at an end u(0, t) = f(t) – Bounded ends: u(0, t) = u(L, t) = 0 for all t

Discrete solution Deal with three states at a time (all positions at t -1, t, t+1) Let L = number of nodes (distinct discrete positions) – Create a 2D array of size 3*L – Denote pointers to where each region begins – Cyclically overwrite regions you don’t need!

Discrete solution Sequential pseudocode: fill data array with initial conditions for all times t = 0… t_max point old_data pointerwhere current_data used to be point current_data pointer where new_data used to be point new_data pointerwhere old_data used to be! (so that we can overwrite the old!) for all positions x = 1…up to number of nodes-2 calculate f(x, t+1) end set any boundary conditions! (every so often, write results to file) end

GPU Algorithm - Kernel (Assume kernel launched at some time t…) Calculate y(x, t+1) – Each thread handles only a few values of x! Similar to polynomial problem

GPU Algorithm – The Wrong Way Recall the old “GPU computing instructions”: – Setup inputs on the host (CPU-accessible memory) – Allocate memory for inputs on the GPU – Copy inputs from host to GPU – Allocate memory for outputs on the host – Allocate memory for outputs on the GPU – Start GPU kernel – Copy output from GPU to host

GPU Algorithm – The Wrong Way Sequential pseudocode: fill data array with initial conditions for all times t = 0… t_max adjust old_data pointer adjust current_data pointer adjust new_data pointer allocate memory on the GPU for old, current, new copy old, current data from CPU to GPU launch kernel copy new data from GPU to CPU free GPU memory set any boundary conditions! (every so often, write results to file) end

GPU Algorithm – The Wrong Way Insidious memory transfer! Many memory allocations!

GPU Algorithm – The Right Way Sequential pseudocode: allocate memory on the GPU for old, current, new Either: Create initial conditions on CPU, copy to GPU Or, calculate and/or set initial conditions on the GPU! for all times t = 0… t_max adjust old_data address adjust current_data address adjust new_data address launch kernel with the above addresses Either: Set boundary conditions on CPU, copy to GPU Or, calculate and/or set boundary conditions on the GPU End free GPU memory

GPU Algorithm – The Right Way Everything stays on the GPU all the time! – Almost…

Getting output What if we want to get a “snapshot” of the simulation? – That’s when we GPU-CPU copy!

GPU Algorithm – The Right Way Sequential pseudocode: allocate memory on the GPU for old, current, new Either: Create initial conditions on CPU, copy to GPU Or, calculate and/or set initial conditions on the GPU! for all times t = 0… t_max adjust old_data address adjust current_data address adjust new_data address launch kernel with the above addresses Either: Set boundary conditions on CPU, copy to GPU Or, calculate and/or set boundary conditions on the GPU Every so often, copy from GPU to CPU and write to file End free GPU memory

Multi-GPU Utilization What if we want to use multiple GPUs… – Within the same machine? – Across a distributed system? GPU cluster, CSIRO

Recall our calculation:

Multiple GPU Solution Big idea: Divide our data array between n GPUs!

Multiple GPU Solution In reality: We have three regions of data at a time (old, current, new)

Multiple GPU Solution Calculation for timestep t+1 uses the following data: t t-1 t+1

Multiple GPU Solution Problem if we’re at the boundary of a process! x t t-1 t+1

Multiple GPU Solution After every time-step, each process gives its leftmost and rightmost piece of “current” data to neighbor processes! Proc0 Proc1 Proc2Proc3 Proc4

Multiple GPU Solution Pieces of data to communicate: Proc0 Proc1 Proc2Proc3 Proc4

Multiple GPU Solution (More details next lecture) General idea – suppose we’re on GPU r in 0…(N-1): – If we’re not GPU N-1: Send data to process r+1 Receive data from process r+1 – If we’re not GPU 0: Send data to process r-1 Receive data from process r-1 – Wait on requests

Multiple GPU Solution Communication can be expensive! – Expensive to communicate every timestep to send 1 value! – Better solution: Send some m values every m timesteps!

Possible Implementation Initial setup: (Assume 3 processes) Proc0Proc1 Proc2

Possible Implementation Give each array “redundant regions” (Assume communication interval = 3) Proc0Proc1 Proc2

Possible Implementation Every (3) timesteps, send some of your data to neighbor processes!

Possible Implementation Send “current” data (current at time of communication) Proc0Proc1 Proc2

Possible Implementation Then send “old” data Proc0Proc1 Proc2

Then… – Do our calculation as normal, if we’re not at the ends of our array Our entire array, including redundancies!

What about corruption? Suppose we’ve just copied our data… (assume a non-boundary process) –. = valid – ? = garbage – ~ = doesn’t matter – (Recall that there exist only 3 spaces – gray areas are nonexistent in our current time

What about corruption? Calculate new data… – Value unknown!

What about corruption? Time t+1: – Current -> old, new -> current (and space for old is overwritten by new…)

What about corruption? More garbage data! – “Garbage in, garbage out!”

What about corruption? Time t+2…

What about corruption? Even more garbage!

What about corruption? Time t+3… – Core data region - corruption imminent!?

What about corruption? Saved! – Data exchange occurs after communication interval has passed!

Boundary corruption? Examine left-most process: – We never copy to it, so left redundant region is garbage! (B = boundary condition set)

Boundary corruption? Calculation brings garbage into non-redundant region!

Boundary corruption? …but boundary condition is set at every interval!