Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 High-Performance Grid Computing and Research Networking Presented by Yuming Zhang Instructor: S. Masoud Sadjadi

Similar presentations


Presentation on theme: "1 High-Performance Grid Computing and Research Networking Presented by Yuming Zhang Instructor: S. Masoud Sadjadi"— Presentation transcript:

1 1 High-Performance Grid Computing and Research Networking Presented by Yuming Zhang Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu Classic Examples of Shared Memory Program

2 2 Acknowledgements The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova henric@hawaii.edu

3 3 Domain Decomposition Now that we know how to create and manage threads, we need to decide which thread does what This is really the art of parallel computing Fortunately, in shared memory, it is often quite simple We’ll look at three examples “Embarrassingly” parallel application load-balancing issue “Non-embarrassingly parallel” application thread synchronization issue Shark & Fish simulation load-balancing AND thread synchronization issue

4 4 Embarrassingly Parallel Embarrassingly parallel applications Consists of a set of elementary computations These computations can be done in any order They are said to be “independent” Sometimes referred to as “pleasantly” parallel Trivial Example: Compute all values of a function of two variables over a 2-D domain function f(x,y) = domain = (]0,10],]0,10]) domain resolution = 0.001 number of points = (10 / 0.001) 2 = 10 8 number of processors and of threads = 4 each thread performs 25x10 6 function evaluations No need for critical sections No shared output

5 5 Mandelbrot Set In many cases, the “cost” of computing f varies with its input Example: Mandelbrot For each complex number c Define the series Z 0 = 0 Z n+1 = Z n 2 + c If the series converges, put a black dot at point c i.e., if it hasn’t diverged after many iterations If one partitions the domain in 4 squares among 4 threads, some of the threads will have much more work to do than others

6 6 Mandelbrot and Load Balancing The problem with partitioning the domain into 4 identical tiles is that it leads to load imbalance i.e., suboptimal use of the hardware resources Solution: do not partition the domain in as many tiles as threads instead use many more tiles than threads Then have each thread operate as follows compute a tile when done “request” another tile until there are no tiles left to compute This is called a “master-worker” execution confusing terminology that will make more sense when we do distributed memory programming

7 7 Mandelbrot implementation Conceptually very simple, but how do we write code to do it? Pthreads Use some shared (protected) counter that keeps track of the next tile the “keeping track” can be easy or difficult depending of the shape of the tiles Threads read and update the counter each time When the counter goes over some predefined value terminate OpenMP Could be done in the same way But OpenMP provides tons of convenient ways to do parallel loops including “dynamic” scheduling strategies, which do exactly what we need! Just write the code as a loop over the tiles Add the proper pragma And you’re done

8 8 Dependent Computations In many applications, things are not so simple: elementary computations may not be independent otherwise parallel computing would be pretty easy A common example: Consider a (1-D, 2-D,...) domain that consists of “cells” Each cell holds some “state”, for example: temperature, pressure, humidity, wind velocity RGB color value The application consists of rule(s) that must be applied to update the cell states possibly over-and-over in an iterative fashion CFD, game of life, image processing, etc. Such applications are often termed Stencil Applications We have already talked about one example: Heat Transfer

9 9 Dependent Computations Really simple: Cell values: one floating point number Program written with two arrays: f_old f_new One simple loop: f_new[i] = f_old[i] +... In more “real” cases, the domain in 2-D (or worse), there are more terms, and the values on the right hand side can be at time step m+1 as well Example from: http://ocw.mit.edu/NR/rdonlyres/Nuclear-Engineering/22-00JIntroduction-to- Modeling-and-SimulationSpring2002/55114EA2-9B81-4FD8-90D5-5F64F21D23D0/0/lecture_16.pdf

10 10 Wavefront Pattern i,j-1 i-1,j i,j i-1,j-1 2-D domain Example stencil shapes Data elements are laid out as multidimensional grids representing a logical plane or space. The dependency between the elements, often formulated by dynamic programming, results in computations known as wavefront.

11 11 The Longest-Common-Subsequence Problem LCS Given two sequences A= and B=, find the longest sequence that is a subsequence of both A and B. If A = and B =, the longest common subsequence of A and B is. a valuable tool in finding valuable information regarding amino acid sequences in biological genes. Determine F[n, m] Let F[i, j] be the length of the longest common subsequence of the first i elements of A and the first j elements of B.

12 12 LCS Wavefront F[i,j-1] F[i-1,j] F[i,j] F[i-1,j-1] The computation starts from F[0,0] and starts filling out the memoization space table diagonally.

13 13 One example Computing the LCS of amino acid sequences and. F[n, m] = 5 is the answer.

14 14 Wavefront computation How can we parallelize a wavefront computation? We have seen that the computation consists in computing 2n-1 antidiagonals, in sequence. Computations within each antidiagonal are independent, and can be done in a multithreaded fashion Algorithm: for each antidiagonal use multiple threads to compute its elements one may need to use a variable number of threads because some diagonals are very small, while some can be large can be implemented with a single array

15 15 Wavefront computation What about cache efficiency? After all, reading only one element from an anti diagonal at a time is probably not good They are not contiguous in memory! Solution: blocking Just like matrix multiply p0p0 p1p1 p2p2 p3p3

16 16 Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at a time is probably not good Solution: blocking Just like matrix multiply p0p0 p1p1 p2p2 p3p3 1

17 17 Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at a time is probably not good Solution: blocking Just like matrix multiply 2 2 p0p0 p1p1 p2p2 p3p3

18 18 Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at a time is probably not good Solution: blocking Just like matrix multiply 3 3 3 p0p0 p1p1 p2p2 p3p3

19 19 Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at a time is probably not good Solution: blocking Just like matrix multiply 4 p0p0 p1p1 p2p2 p3p3 4 4 4 3 3 3 2 2 1 5 5 5 6 6 7

20 20 Workload Partitioning First the matrix is divided into parts of adjacent columns equal to the numbers of clusters. Afterwards the part within each cluster is partitioned. The computation is then performed in the same way.

21 21 Performance Modeling One thing we’ll need to do often in HPC is building performance models Given simple assumptions regarding the underlying architecture e.g., ignore cache effects Come up with an analytical formula for the parallel speed-up Let’s try it on this simple application Let N be the (square) matrix size Let p be the number of threads/cores, which is fixed

22 22 Performance Modeling What if we use p 2 blocks? We assume that p divides N (N > p) Then the computation proceeds in 2p-1 phases each phase lasts as long as the time to compute one block (because of concurrency), T b Therefore Parallel time = (2p-1) T b Sequential time = p 2 T b Parallel speedup = p 2 / (2p - 1) Parallel efficiency = p / (2p -1) Example: p=2, speedup = 4/3, efficiency = 66% p=4, speedup = 16/7, efficiency = 57% p=8, speedup = 64/17, efficiency = 53% Asymptotically: efficiency = 50%

23 23 Performance Modeling What if we use (b x p) 2 blocks? b some integer between 1 and N/p We assume that p divides N (N > p) But performance modeling becomes more complicated The computation still proceeds in 2bp-1 phases But a thread can have more than one block to compute during a phase! During phase i, there are i blocks to compute for i=1,..,bp 2bp-i blocks to compute for i=bp+1,...,2bp-1 If there are x (>0) blocks to compute in a phase, then the execution time for that phase is:  (x-1)/p  + 1) Assuming T b = 1 Therefore, the parallel execution time is 0 3 2 1 0 1

24 24 Performance Modeling

25 25 Performance Modeling Example: N=1000, p = 4

26 26 Performance Modeling When b gets larger, speedup increases and tends to p Since b <= N/p, best speed-up: Np / (N + p -1) When N is large compared to p, speedup is very close to p Therefore, use a block size of 1, meaning no blocking! We’re back to where we started because our performance model ignores cache effects! Trade-off: From a parallel efficiency perspective: small block size From a cache efficiency perspective: big block size Possible rule of thumb: use the biggest block size that fits in the L1 cache (L2 cache?) Lesson: full performance modeling is difficult We could add the cache behavior, but think of a dual-core machine with shared L2 cache, etc. In practice: do performance modeling for asymptotic behaviors, and then do experiments to find out what works best

27 27 Sharks and Fish Simulation of a population of preys and predators Each entity follows some behavior Preys move and breed Predators move, hunt, and breed Given initial populations, nature of the entity behaviors (e.g., probability of breeding, probability of successful hunting), what do populations look like after some time? This is something computational ecologists do all the time to study ecosystems

28 28 Sharks and Fish There are several possibilities to implement such a simulation A simple one is to do something that looks like “the game of life” A 2-D domain, with NxN cells (each cell can be described by many environmental parameters) Each cell in the domain can hold a shark or a fish The simulation is iterative There are several rules for movement, breeding, preying Why do it in parallel? Many entities Entity interactions may be complex How can one write this in parallel with threads and shared memory?

29 29 Space partitioning One solution is the divide the 2-D domain between threads Each thread deals with the entities in its domain

30 30 Space partitioning One solution is the divide the 2-D domain between threads Each thread deals with the entities in its domain 4 threads

31 31 Move conflict? Threads can make decisions that will lead to conflicts!

32 32 Move conflict? Threads can make decisions that will lead to conflicts!

33 33 Dealing with conflicts Concept of shadow cells Only entities in the red regions may cause a conflict One possible implementation Each thread deals with its green region Thread 1 deals with its red region Thread 2 deals with its red region Thread 3 deals with its red region Thread 4 deals with its red region Repeat Will still prevent some types of moves No swapping of location The implementer must make choices

34 34 Load Balancing What if all the fish end up in the same region? because they move because they breed Then one thread has much more work to do that the others Solution: dynamic repartitioning Modify the partitioning so that the load is balanced But perhaps one good idea would be to not do domain partitioning at all! How about doing entity partitioning Better load balancing, but more difficult to deal with conflicts May use locks, but high overhead

35 35 Conclusion Main lessons There are many classes of applications, with many domain partitioning schemes Performance modeling is fun but inherently limited It’s all about trade-offs overhead - load balancing parallelism - cache usage etc. Remember, this is the easy side of parallel computing Things will become much more complex in distributed memory programming


Download ppt "1 High-Performance Grid Computing and Research Networking Presented by Yuming Zhang Instructor: S. Masoud Sadjadi"

Similar presentations


Ads by Google