Module 6: Introduction to Parallel Computing

Module 6: Introduction to Parallel Computing

Computing the Sum of 16 numbers on a 16-processor System
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 12 13 a14 a15 Initial data distribution and the first communication step a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 Second communication step a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 Third communication step

Computing the Sum of 16 numbers on a 16-processor System
14 a15 Fourth communication step a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 Accumulation of the sum at processor 0 after the final communication

Parallel Computing Partition task into a set of subtasks Distribute subtasks to multiple processors Give each processor the data needed for the subtask Perform Computation Simultaneously Exchange data through interconnection network Perform some more computation Put the product together and output results

Issues Of Parallel Computing
Pros Cons Save elapse time Difficult to construct Allows huge amount efficient parallel algorithm of memory access may need some thought Not all problems can be solved in parallel: Ex: Digging a hole, a mother giving birth to a child KEYS distribute subtasks optimally to balance the work load and minimize communication among processors

The Need for Parallel Computing
Despite advances in device technology and processor architecture, single processor architecture cannot handle “Grand Challenge” problems (Fundamental problem in science or engineering with broad economic and scientific impact, whose solution will require the application of high-performance computing resources). Prediction of weather, climate, and global change Simulation of the decay of the nuclear stockpile Simulation of the movements of galaxies Speech Recognition, Machine Vision, Drug Design, Fluid Turbulence, Natural Language Understanding Analysis of Protein Structure Large transactions with a database Oil exploration

SISD Execution Model (von Neumann model)
Processor Memory 1. Fetch instruction 2. Decode instruction 3. Execute instruction 4. Go to 1

Taxonomy of Parallel Programming

SIMD Execution Model Array processors (Conn. Machine, MasPar): Parallelism is exploited through multiple processing elements control unit CPU0 MEM0 CPU1 MEM1 CPUn-1 MEMn-1 Interconnection network

Shared Memory Model Shared memory contains instructions and data executed and processed by processors (SGI Challenge XL, Convex SPP 1200) Shared address space access location m access location k PEi PEj

Distributed Memory Model
Only local memory is visible to a processor (Sp2, Pragon, Cray T3E) local memory local memory PEi PEj Message exchange

PRAM (Parallel Random Access Machine) Model
Purely abstract model for helping to: - develop parallel algorithms - check whether an algorithm is suitable for parallelization evaluate the time complexity of a parallel algorithm. The PRAM model consists of a global memory shared by processors P1, P2, Pp. Each processor has its own local memory, and processors communicate via global memory. Processors share a common clock but may execute different instructions in each cycle. Time to access any word in memory by any processor is the same (Uniform Memory Access – UMA computer)

Four possible ways of handling memory accesses ER - exclusive read. One processor reads at a time. EW - exclusive write. One processor writes at a time. CR - concurrent read. Multiple processors can read from one location at the same time. CW - concurrent write. Multiple processors can write to the same location at the same time. 4 models of a PRAM EREW, ERCW, CREW, and CRCW.

CW resolved using the following approaches: - Common - all processors write the same value at one memory location. - Arbitrary - one value is stored, others are lost; the winner is randomly selected. - Minimum - the value written by the processor with the lowest index wins. - Reduction - all values are reduced to one by applying some reduction operation, e.g., min, max, sum. Cost = T(n) x P(n) where T(n) is the time complexity and P(n) is a function that represents the number of processors as it relates to the input size, n.

Speedup = worst-case running time fastest sequential algorithm for problem/ worst-case running time of parallel algorithm Example 1: Broadcast Example 2: Sum n numbers Example 3: Sorting Algorithm

Characteristics of our Parallel Algorithms
One version of the algorithm is written. After compilation it is executed by each processor simultaneously. A variable declared in the algorithm could be a variable in shared memory, which means it is accessible to all processors, or it could be in private memory. We use the key word local when declaring a variable in private memory. An algorithm consists of a number of steps, and every processor starts each step at the same time and ends each step at the same time.

Characteristics of our Parallel Algorithms(Cont.)
All processors that read during a given step read at the same time and ends each step at the same time. Can not manipulate data directly in shared memory. The only operation that are allowed in shared memory are read from the shared memory or write to the shared memory.

10.2 Example PRAM Model Void example (int n, int S[])
// S is the data set in shared memory { local index p; local int temp; p = index of this processor; read S[p] into temp; if (p < n) write temp into S[p+1]; else write temp into S[1]; }

BROADCAST (EREW) (Example 1 Class Web Site)
Void broadcast (int n, int L[]) { index i; local index p; local int temp; p = index of this processor; if (p == 1) read temp from input; write temp into L[1]; }

BROADCAST (EREW) (Example 1 Class Web Site Cont.)
for (i = 0; i <= (log2 n - 1); i++) { // if this processor needs to execute at this step if (( p >= 2 i + 1) && (p <= 2 (i +1))) read L[p – 2i ] into temp; write temp into L[p]; }

SUM N NUMBERS –EREW (Example 2 Class Web Site)
Void Sum (int A[], int n) { index step, size; local p, temp1, temp2; p = index of this processor; // Total n/2 processors size = 1; for (step = 1; step<= log2 n; step++) // if this processor needs to execute at this step . }

SORTING (CRCW) Procedure CRCWSort (int A[], int n) { int C[1..n];
local index i, j i = first index of this processor; j = second index of this processor if (A[i] > A[j]) write 1 in C[i] // use sum reduction technique else write 0 in C[i] if (j == 1) Store A[i] in position C[i + 1] of A }

Some more Parallel Algorithms
Algorithm 10.1: Finding the largest key in an Array using CREW PRAM Model. Algorithm 10.2: Parallel Binomial Coefficient using CREW PRAM Model. Algorithm 10.3: Parallel Merge Sort using CREW PRAM Model. Algorithm 10.4: Finding the largest key in an Array using CRCW PRAM Model.

Find Largest Key CREW Model
// n/2 processors, n is a power of 2 See Figure 11.10 Procedure Parlargest (int n, keytype S[]) { index step, size local index p; local keytype first, second; p = index of this processor; size = 1; for (step =1; step <= log n ; step++) if this processor needs to execute at this step // (p –1) % size = 0 read S[2*p –1] into first; read S[2*p –1] + size] into second; write maximum (first,second) into S[2*p – 1]; size = 2 * size; } Time Complexity?

Find Largest Key CRCW with Arbitrary Write
// n * (n-1)/2 processors See Figure 11.11 // With 4 elements, we have processors p12, p13, p14, p23, p24, and p34 Procedure parlargest (int n, keytype S[]) { int T[1..n] index step, size local index i, j; local keytype first, second; local int chkfrst, chkscnd i = first index of this processor; j = second index of this processor; write 1 into T[i]; write 1 into T[j]; read S[i] into first; read S[j] into second;

Find Largest Key CRCW with Arbitrary Write Contd.
// n * n-1)/2 processors See Figure 11.11 if (first < second) // updates with 0 only write 0 into T[i]; else write 0 into T[j]; read T[i] into chkfrst; read T[j] into chkscnd; if (chkfrst == 1) return S[i]; else if (chkscnd == 1) return S[j]; }

Parallel Merge Sort CREW Model
// n/2 processors, n is a power of 2 Procedure parmergesort (int n, keytype S[]) { index step, size local index p, low, mid, high; p = index of this processor; size = 1; for (step =1; step <= log n ; step++) if this processor needs to execute at this step // (p –1) % size = 0 low = 2 *p –1; mid = low + size -1; high = low + 2 * size – 1; parmerge (low, mid, high, S); // merge 2 arrays low .. mid and mid+1 .. high size = 2 * size; } Time Complexity?

Module 6: Introduction to Parallel Computing

Similar presentations

Presentation on theme: "Module 6: Introduction to Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Module 6: Introduction to Parallel Computing

Similar presentations

Presentation on theme: "Module 6: Introduction to Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback