Paraguin Compiler Examples.

Slides:



Advertisements
Similar presentations
Practical techniques & Examples
Advertisements

Reducibility Class of problems A can be reduced to the class of problems B Take any instance of problem A Show how you can construct an instance of problem.
1 Optimization Algorithms on a Quantum Computer A New Paradigm for Technical Computing Richard H. Warren, PhD Optimization.
Combinatorial Algorithms
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.
12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
C. – C. Yao Data Structure. C. – C. Yao Chap 1 Basic Concepts.
Difficult Problems. Polynomial-time algorithms A polynomial-time algorithm is an algorithm whose running time is O(f(n)), where f(n) is a polynomial A.
Analysis of Algorithms
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
C Programming - Lecture 6 This lecture we will learn: –Error checking in C –What is a wrappered function? –How to assess efficiency. –What is a clean interface?
Limits to Computation How do you analyze a new algorithm? –Put it in the form of existing algorithms that you know the analysis. –For example, given 2.
MPI and OpenMP.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
Announcements The sample C code from these slides is available in the same place as these class notes. It can compile with both C and C++ compilers. You.
Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.
Exhaustive search Exhaustive search is simply a brute- force approach to combinatorial problems. It suggests generating each and every element of the problem.
Suzaku Pattern Programming Framework (a) Structure and low level patterns © 2015 B. Wilkinson Suzaku.pptx Modification date February 22,
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Numerical Algorithms Chapter 11.
Lesson #5 Repetition and Loops.
LINKED LISTS.
Computer Organization and Design Pointers, Arrays and Strings in C
EMIS 8373: Integer Programming
5.13 Recursion Recursive functions Functions that call themselves
A bit of C programming Lecture 3 Uli Raich.
Analysis of Algorithms
Hybrid Parallel Programming with the Paraguin compiler
Lesson #5 Repetition and Loops.
CS 668: Lecture 3 An Introduction to MPI
Lecture 13 & 14.
Paraguin Compiler Examples.
Unsolvable Problems December 4, 2017.
Algorithm Analysis CSE 2011 Winter September 2018.
Sieve of Eratosthenes.
Parallel Graph Algorithms
Parallel Programming with MPI and OpenMP
Big-Oh and Execution Time: A Review
Using compiler-directed approach to create MPI code automatically
Lesson #5 Repetition and Loops.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
Paraguin Compiler Communication.
Paraguin Compiler Version 2.1.
Paraguin Compiler Examples.
CSCE569 Parallel Computing
Paraguin Compiler Version 2.1.
May 19 Lecture Outline Introduce MPI functionality
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Approximation Algorithms
CSCE569 Parallel Computing
Pattern Programming Tools
3. Brute Force Selection sort Brute-Force string matching
Hybrid Parallel Programming
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
3. Brute Force Selection sort Brute-Force string matching
Patterns Paraguin Compiler Version 2.1.
Lesson #5 Repetition and Loops.
David Kauchak cs161 Summer 2009
Parallel Graph Algorithms
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Hybrid Parallel Programming
Quiz Questions How does one execute code in parallel in Paraguin?
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Paraguin Compiler Examples

Examples Matrix Addition (the complete program) Traveling Salesman Problem (TSP) Sobel Edge Detection

Matrix Addition The complete program

Matrix Addition (complete) #define N 512 #ifdef PARAGUIN typedef void* __builtin_va_list; extern int MPI_COMM_WORLD; extern int MPI_Barrier(); #endif #include <stdio.h> #include <math.h> #include <sys/time.h> print_results(char *prompt, float a[N][N]); int main(int argc, char *argv[]) { int i, j; float a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd;

Matrix Addition (complete) double elapsed_time; struct timeval tv1, tv2; if (argc < 2) { fprintf (stderr, usage, argv[0]); return -1; } if ((fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0], argv[1]);

Matrix Addition (complete) // Read input from file for matrices a and b. // The I/O is not timed because this I/O needs // to be done regardless of whether this program // is run sequentially on one processor or in // parallel on many processors. Therefore, it is // irrelevant when considering speedup. for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &a[i][j]); fscanf (fd, "%f", &b[i][j]);

Matrix Addition (complete) #ifdef PARAGUIN ; #pragma paraguin begin_parallel // This barrier is here so that we can take a time stamp // Once we know all processes are ready to go. MPI_Barrier(MPI_COMM_WORLD); #pragma paraguin end_parallel #endif // Take a time stamp gettimeofday(&tv1, NULL); // Broadcast the input to all processors. This could be // faster if we used scatter, but Bcast is easy and scatter // is not implemented in Paraguin #pragma paraguin bcast a b

Matrix Addition (complete) // Parallelize the following loop nest assigning iterations // of the outermost loop (i) to different partitions. #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 // We need to gather all values c[i][j]. So we can just // use i,j => 0. #pragma paraguin gather 0x0 C i j \ 0x0 0x0 0x0 for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; }

Matrix Addition (complete) ; #pragma paraguin end_parallel // Take a time stamp. This won't happen until after the master // process has gathered all the input from the other processes. gettimeofday(&tv2, NULL); elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0); printf ("elapsed_time=\t%lf (seconds)\n", elapsed_time); // print result print_results("C = ", c); }

Matrix Addition (complete) print_results(char *prompt, float a[N][N]) { int i, j; printf ("\n\n%s\n", prompt); for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { printf(" %.2f", a[i][j]); } printf ("\n"); printf ("\n\n");

Matrix Addition After compiling with the command: This produces: runparaguin matrixadd.c This produces: matrixadd.out.c (source with MPI) matrixadd.out (compiled with mpicc) (Demonstration)

Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 The expression above assigns each iteration of the i loop to its own partition (p = i). We could also partition along the j loop: 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 Or would could have many other partitions

Partitioning Reviewed The partitioning is a system of inequalities written in matrix/vector form: where is a matrix, and and are vectors.

Partitioning Reviewed So the partition expressed in the pragma: #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 Represents the following:

Partitioning Reviewed If we multiply this out: We get:

Partitioning Reviewed Now simplify:

Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 1 0x0 \ 0x0 1 -1 0x0 j p = i p=0 p=1 p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 i p=10 p=11

Partitioning Reviewed So the partition expressed in the pragma: #pragma paraguin forall C p i j \ 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 Represents the following:

Partitioning Reviewed If we multiply this out: We get:

Partitioning Reviewed Now simplify:

Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 p=11 p=10 p=9 p=8 p=7 p=6 j p=5 p = j p=4 p=3 p=2 p=1 p=0 i

Partitioning Reviewed Let’s say we want to partition using p=i+j We actually have to go the other direction

Partitioning Reviewed

Partitioning Reviewed To write this as a pragma: #pragma paraguin forall C p i j \ 0x0 -1 1 1 \ 0x0 1 -1 -1

Partitioning Reviewed #pragma paraguin forall C p i j \ 0x0 -1 1 1 \ 0x0 1 -1 -1 p=23 p=22 p=21 p=20 p=19 p=18 p=17 p=16 p = i + j p=15 j p=14 p=13 p=12 p=11 p=10 p=9 p=8 p=7 p=6 p=5 p=4 p=3 p=2 p=1 i

Traveling Salesman Problem (TSP)

The Traveling Salesman Problem is simply to find the shortest circuit (Hamiltonian circuit) that visits every city in a set of cities at most once

This problem falls into the class of “NP-hard” problems What that means is that there is no known “polynomial” time (“big-oh” of a polynomial) algorithm that can solve it The only know algorithm to solve it is to compare the distances of all possible Hamiltonian circuits. But there are N! possible circuits of N cities.

Yes heuristics can be applied to find a “good” solution fast, but there’s no guarantee it is the best The “brute force” algorithm is to consider all possible permutations of the N cities First we’ll fix the first city since there are N equivalent circuits where we rotate the cities We will consider the reverse directions to be different circuits but that’s hard to account for

If we number the cities from 0 to N-1, and 0 is the origination city, then the possible permutations of 4 cities are: 0->1->2->3->0 0->1->3->2->0 0->2->3->1->0 0->2->1->3->0 0->3->1->2->0 0->3->2->1->0 Notice that there are some permutations that are the reverse of other. These are equivalent permutations. Since we are fixing origination city, there are (N-1)! permutations.

We can compute the distances between all pairs of locations (O(N2)) This is the input City 0 City 1 City 2 City 3 77.301157 66.648884 10.524875 71.335061 79.977022 59.265103

Solution: Use a for loop to assign the first two cities Problem: Iterating through the possible permutations is recursive, but we need a straight forward for loop to parallelize Solution: Use a for loop to assign the first two cities Since city 0 is fixed, there are n-1 choices for city 1 and n-2 choices for city 2 That means there are (n-1)(n-2) = n2 – 3n + 2 combinations of the first two cities

Assignment of cities 0-2 N = n*n - 3*n + 2; // (n-1)(n-2) perm[0] = 0; for (i = 0; i < N; i++) { perm[1] = i / (n-2) + 1; perm[2] = i % (n-2) + 1; ...

; #pragma paraguin begin_parallel perm[0] = 0; minDist = -1 ; #pragma paraguin begin_parallel perm[0] = 0; minDist = -1.0; if (n == 2) { perm[1] = 1; // If n == 2, then N == 0, // and we are done. minPerm[0] = perm[0]; minPerm[1] = perm[1]; minDist = computeDist(D, n, perm); } #pragma paraguin bcast n #pragma paraguin bcast N #pragma paraguin bcast D

#pragma paraguin forall C p N i \ 0x0 -1 0x0 1 \ 0x0 1 0x0 -1 for (i = 0; i < N; i++) { perm[1] = i / (n-2) + 1; perm[2] = i % (n-2) + 1; ...

Sobel Edge Detection

Sobel Edge Detection Given an image, the problem is to detect where the “edges” are in the picture

Sobel Edge Detection

Sobel Edge Detection Algorithm /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; // handle image boundaries if(x==0 || x==(h-1) || y==0 || y==(w-1)) sum = 0; else{ Pragmas go here

Sobel Edge Detection Algorithm //x gradient approx for(i=-1; i<=1; i++) for(j=-1; j<=1; j++) sumx += (grayImage[x+i][y+j] * GX[i+1][j+1]); //y gradient approx sumy += (grayImage[x+i][y+j] * GY[i+1][j+1]); //gradient magnitude approx sum = (abs(sumx) + abs(sumy)); } edgeImage[x][y] = clamp(sum);

Sobel Edge Detection Algorithm Inputs (that need to be broadcast or scattered): GX and GY arrays grayImage array w and h (width and height) There are 4 nested loops (x, y, i, and j) The final answer is the array edgeImage

Sobel Edge Detection Algorithm We put these in front of that loop to parallelize it. ; #pragma paraguin begin_parallel #pragma paraguin bcast grayImage #pragma paraguin bcast w #pragma paraguin bcast h #pragma paraguin forall C p x y i j \ 0x0 -1 1 0x0 0x0 0x0 \ 0x0 1 -1 0x0 0x0 0x0 #pragma paraguin gather 4 C x y \ 0x0 0x0 0x0 These are the inputs Partition the x loop (outermost loop) Gather all elements of the edgeImage array