1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

CSCE 313: Embedded Systems Scaling Multiprocessors Instructor: Jason D. Bakos.

Embarrassingly Parallel (or pleasantly parallel) Domain divisible into a large number of independent parts. Minimal or no communication Each processor.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 6.3.2, H. Casanova, A. Legrand, Z. Zaogordnov, and F. Berman, "Heuristics.

Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.

Characteristics of Embarrassingly Parallel Computations Easily parallelizable Little or no interaction between processes Can give maximum speedup.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.

CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Designing Parallel Programs David Rodriguez-Velazquez CS-6260 Spring-2009 Dr. Elise de Doncker.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

1 CS4402 – Parallel Computing Lecture 6 MPE Library – Graphics in MPI. Parallel Computer Graphics.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Computational Biology, Part E Basic Principles of Computer Graphics Robert F. Murphy Copyright  1996, 1999, 2000, All rights reserved.

Ch 9 Infinity page 1CSC 367 Fractals (9.2) Self similar curves appear identical at every level of detail often created by recursively drawing lines.

1 2-Hardware Design Basics of Embedded Processors (cont.)

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Field Trip #33 Creating and Saving Fractals. Julia Set We consider a complex function, f(z) For each point on the complex plane (x,y), where z = x + iy,

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

Threaded Programming Lecture 4: Work sharing directives.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

CS COMPUTER GRAPHICS LABORATORY. LIST OF EXPERIMENTS 1.Implementation of Bresenhams Algorithm – Line, Circle, Ellipse. 2.Implementation of Line,

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Computer Science 320 Parallel Image Generation. The Mandelbrot Set.

CS 420 Design of Algorithms Parallel Algorithm Design.

Embarrassingly Parallel (or pleasantly parallel) Characteristics Domain divisible into a large number of independent parts. Little or no communication.

ALGORITHMS AND FLOWCHARTS. Why Algorithm is needed? 2 Computer Program ? Set of instructions to perform some specific task Is Program itself a Software.

Instructor: Mircea Nicolescu Lecture 5 CS 485 / 685 Computer Vision.

1 Chapter 2 Algorithm Analysis All sections. 2 Complexity Analysis Measures efficiency (time and memory) of algorithms and programs –Can be used for the.

1 Chapter 2 Algorithm Analysis Reading: Chapter 2.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Single Instruction Multiple Threads

Complexity Analysis (Part I)

Auburn University

Conception of parallel algorithms

Data Structure and Algorithms

Graphics Fundamentals

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Parallel Algorithm Design

Algorithm Analysis CSE 2011 Winter September 2018.

COMP60621 Fundamentals of Parallel and Distributed Systems

Parallel Techniques • Embarrassingly Parallel Computations

Embarrassingly Parallel Computations

Mattan Erez The University of Texas at Austin

Computer Graphics - Lecture 6

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

CSCE 313: Embedded Systems Scaling Multiprocessors

Complexity Analysis (Part I)

Design and Analysis of Algorithms

Presentation transcript:

1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling

9/3/20152 FRACTALS

9/3/20153 Fractals A fractal is a set of points such that: - its fractal dimension is infinite [infinite detail at every point]. - satisfies self-similarity: any part of the fractal is similar with the fractal. Generating a fractal is a iterative process: - start from P 0 - iteratively generate P 1 =F(P 0 ), P 2 =F(P 1 ), …, P n =F(P n-1 ), … P 0 is a set of initial points F is a transformation: Geometric transformations: translations, rotations, scaling, … Non-Linear coordinate transformation.

9/3/20154 We work with 2 rectangular areas. The user space: -Real coordinates (x,y) -Bounded between [xMin,xMax]*[yMin,yMax] The screen space -Integer coordinates (i, j) -Bounded between [0,w-1]*[0,h-1] -Is upside down with the Oy axis downward How to squeeze the user space into the screen space? How to translate (x,y) in (i,j)? Points vs Pixels

9/3/20155 Julia Sets – Self-Squaring Fractals Consider the generating function F(z)=z 2 +c, z,c  C. Sequence of complex numbers: z 0  C and z n+1= z n 2 + c. Chaotic behaviour but two attractors for |z n |: 0 and + . For a c  C, Julia’s set J c represents all the points whose orbit is finite.

9/3/20156 Julia Sets – Algorithm Inputs: c  C the complex number; [x min,x max ] * [y min,y max ] a region in plane. N iter a number of iterations for orbits; R a threshold for the attractor . Output: J c the Julia set of c Algorithm For each pixel (i,j) on the screen translate (i,j) into (x,y) construct z 0 =x+j*y; find the orbit of z 0 [first N iter elements] if (all the orbit points are under the threshold) draw (x,y)

9/3/20157 for(i=0; i<=width; i++) for(j=0; j<width; j++) { int k =0; // construct the orbit of z z.re = XMIN + i*STEP; z.im = YMIN + j*STEP; for (k=0; k < NUMITER; k++) { z = func(z,c); if (CompAbs(z) > R) break; } // test if the orbit in infinite if (k>NUMITER-1) { MPE_Draw_point(graph, i,j, MPE_YELLOW); MPE_Update(graph); } else { MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph); }

9/3/20158 Julia Sets – || Algorithm Remark 1. zThe double for loop on (i,j) can be split into processors e.g. y uniform block or cyclic on i. y uniform block or cyclic on j. zNo communication at all between processors, therefore this is embarrassingly || computation. Remark 2. zAll processors draw a block of the fractal or several rows on the XGraph. zP rank knows the area to draw.

9/3/20159 for(i=rank*width/size; i<=(rank+1)*width/size; i++) for(j=0; j<width; j++){ // for(i=rank; i<width; i+=size) for(j=0; j<width; j++){ // for(i=0; i<width; i++) for(j=rank*width/size; j<=(rank+1)*width/size; j++) // for(i=0; i<width; i++) for(j=rank; j<width; j+=size) int k =0; // construct the orbit of z z.re = XMIN + i*STEP; z.im = YMIN + j*STEP; for (k=0; k < NUMITER; k++) { z = func(z,c); if (CompAbs(z) > R) break; } // test if the orbit in infinite if (k>NUMITER-1) { MPE_Draw_point(graph, i,j, MPE_YELLOW); MPE_Update(graph); } else { MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph); }

9/3/201510

9/3/201511

9/3/ The Maldelbrot Set THE MANDELBROT FRACTAL IS AN INDEX FOR JULIA FRACTALS Maldelbrot Set contains all the points c  C such that z 0 =0 and z n+1= z n 2 + c has an finite orbit. Inputs: [x min,x max ] * [y min,y max ] a region in plane. N iter a number of iterations for orbits; R a threshold for the attractor . Output: M the Mandelbrot set. Algorithm For each (x,y) in [x min,x max ] * [y min,y max ] c=x+i*y; find the orbit of z 0 =0 while under the threshold. if (all the orbit points are not under the threshold) draw c(x,y)

9/3/ for(i=0; i<=width; i++) for(j=0; j<width; j++) { int k =0; // construct the point c c.re = XMIN + i*STEP; c.im = YMIN + j*STEP; // construct the orbit of 0 z.re = z.im = 0; for (k=0; k < NUMITER; k++) { z = func(z,c); if (CompAbs(z) > R) break; } // test if the orbit in infinite if (k>NUMITER-1) { MPE_Draw_point(graph, i,j, MPE_YELLOW); MPE_Update(graph); } else { MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph); }

9/3/ The Mandelbrot Set – || Algorithm Remark 1. zThe double for loop on (i,j) can be split into processors e.g. y uniform block or cyclic on i. y uniform block or cyclic on j. zNo communication at all between processors, therefore this is embarrassingly || computation. Remark 2. zWhen the orbit goes to infinity in k steps then we can draw the pixel (i,j) with the k-th color from a palette. zBands color-ed similarly contain points with the same behaviour.

9/3/201515

9/3/ Fractal and Prime Numbers Prime numbers can generate fractals. Remarks: - If p>5 is prime then p%5 is 1,2,3,4. - 1,2,3,4 represent direction to do e.g. left, right, up down. - The fractal has the sizes w and h. Step 1. Initialise a matrix of color with 0. Step 2. For each number p>5 If p is prime then if(p%5==1)x=(x-1)%w; if(p%5==2)x=(x+1)%w; if(p%5==3)y=(y-1)%w; if(p%5==4)y=(y+1)%w; Increase the color of (x,y) Step 3. Draw the pixels with the color matrix.

9/3/ Simple Remarks The prime number set is infinite, furthermore it has no patter. prime: 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, … move: 3, 0, 2, 1, 3, 2, 4, 3, 4, 1, 2, … The set of moves satisfies: - it does not have any pattern  moves are quite random. - the number of 1-s, 2-s, 3-s and 4-s moves are quite similar, hence the central pixels are reached more often. The computation of the for loop is the most expensive operation.

9/3/ // initialise a matrix with 0 for(i=0;i<width;i++)for(j=0;j<width;j++)map[i][j]=0; //start from the image centre posX = posY = width/2; // traverse the set of prime numbers for(i=0;i<n;i++) { if(isPrime(2*i+1)) { // move to a new position on the map and increment it move = (2*i+1)%5; if (move==1) posX = (posX-1)%width; if (move==2) posX = (posX+1)%width; if (move==3) posY = (posY-1)%width; if (move==4) posY = (posY+1)%width; map[posY][posX]++ }

9/3/ Parallel Computation: Simple Remarks Processor rank gets some primes to test using some partitioning. Processor rank therefore will traverse the pixels according with some moves. Processor rank has to work with its own matrix map. The map must be reduce on processor 0 to find the total number of hits.

9/3/ Parallel Computation: Simple Remarks The parallel computation of processor rank follows the steps: 1.Initialise the matrix map. 2.For each prime number assigned to rank do 1.Find the move and go to a new location 2.Increment the map 3.Reduce the matrix map. 4.If processor 0 then draw the map.

21 Splitting Loops How to split the sequential loop if we have size processors? Maths: n iterations & size processors  n/size iterations per processor. for(i=0;i<n;i++) { // body of loop loop_body(data,i); }

22 Splitting Loops in Similar Blocks P rank gets the iterations rank*n/size, rank*n/size+1,…, (rank+1)*n/size-1 for(i=rank*n/size;i<(rank+1)*n/size;i++) { //aquire the data for this iteration loop_body(data,i); } rank*n/size (rank+1)*n/size-1 P rank

23 Splitting Loops in Cycles P rank gets the iterations rank, rank+size, rank+2*size,…. for(i=rank;i<n;i+=size) { //aquire the data for this iteration loop_body(data,i); } P rank

24 Splitting Loops in Variable Blocks P rank gets the iterations l[rank], l[rank]+1,…, u[rank] for(i=l[rank];i<=u[rank];i++) { //aquire the data for this iteration loop_body(data,i); } l[rank] u[rank] P rank

9/3/ // initialise a matrix with 0 for(i=0;i<width;i++)for(j=0;j<width;j++)map[i][j]=0; //start from the image centre posX = posY = width/2; // traverse the set of prime numbers for(i=rank*n/size;i<(rank+1)*n/size;i++) { if(isPrime(p=2*i+1)) { // move to a new position on the map and increment it move = p%5; if (move==1) posX = (posX-1)%width; if (move==2) posX = (posX+1)%width; if (move==3) posY = (posY-1)%width; if (move==4) posY = (posY+1)%width; map[posY][posX]++ } MPI_Reduce(&map[0][0], &globalMap[0][0], width*width, MPI_LONG, MPI_SUM, 0, MPI_COMM_WORLD); if(rank==0) { for(i=0;i<width;i++)for(j=0;j<width;j++) MPE_Draw_point(graph, i, j, colors[globalMap[i][j]); }

9/3/201526

9/3/ Scheduling

28 Parallel Loops Parallel loops represent the main source of parallelism. Consider a system with p processors P 1,P 2,…, P p and for i=1, n do call loop_body(i) end for Scheduling Problem: Map the iterations {1,2,…,n} onto processors so that: - the execution time is minimal. - the execution times per processors are balanced. - the processor’s idle time is minimal.

29 Parallel Loops Suppose that the workload of loop_body is know and given by w 1, w 2,…, w n. For Processor P J the set of iteration is S J ={i 1, i 2, …, i k } so - The execution time of Processor P J is T(P J )=∑ {w i : i in S J } - The execution time of the parallel loop is T=max{T(P J ): j=1,2,..,p}. Static Scheduling: the partition is found at the compiling time. Dynamic Scheduling: the partition is found at the running time.

30 Data Dependency A dependency exists between program statements when the order of statement execution affects the results of the program. A data dependency results from multiple use of the same location(s) in storage by different tasks. A data is “input” for another data. Dependencies are important to parallel programming because they are one of the primary inhibitors to parallelism. Loops with data dependencies cannot be scheduled. Example: The following for loop contains data dependencies. for i=1, n do a[i]=a[i-1]+1 end for

31 Load Balancing Load balancing refers to the practice of distributing work among processors so that all processors are kept busy all of the time. If all the processor execution times are the same then a perfect load balance is achieved. Load Imbalance is the most important overhead of parallel computation and reflects the case when there is a difference between two execution times.

32

33

34 Useful Rules: - If the workloads are similar then use static uniform block scheduling. - If the workloads increase/decrease then use static cyclic scheduling. - If we know the workloads and they are simple then guide the load balance. - If the workloads are not known they use dynamic methods.

35 Balanced Workload Block Scheduling w 1, w 2, …, w n the workload of the iterations - total workload is w 1 + w 2 + …+ w n - average per processor is Each Processor gets consecutive iterations: -l rank u rank – the lower and upper indices of the block - The workload is

36 Balanced Workload Block Scheduling Simple to work with integrals: Average Workload per a processor is Each processor workload is

37

38

39

40

41

42 Granularity Granularity is the ratio of computation to communication. Periods of computation are typically separated from periods of communication by synchronization events. Fine-grain Parallelism: Relatively small amounts of computational work are done between communication events. Facilitates load balancing and Implies high communication overhead and less opportunity for performance enhancement Coarse-grain Parallelism: Relatively large amounts of computational work are done between communication/synchronization events. Harder to load balance efficiently