2Design of efficient algorithms A parallel computer is of little use unless efficient parallel algorithms are available.The issue in designing parallel algorithms are very different from those in designing their sequential counterparts.A significant amount of work is being done to develop efficient parallel algorithms for a variety of parallel architectures.
3Processor Trends Moore’s Law Parallelization within processors performance doubles every 18 monthsParallelization within processorspipeliningmultiple pipelines
4Why Parallel Computing Practical:Moore’s Law cannot hold foreverProblems must be solved immediatelyCost-effectivenessScalabilityTheoretical:challenging problems
5Efficient and optimal parallel algorithms A parallel algorithm is efficient iffit is fast (e.g. polynomial time) andthe product of the parallel time and number of processors is close to the time of at the best know sequential algorithmT sequential T parallel N processorsA parallel algorithms is optimal iff this product is of the same order as the best known sequential time
6The main open questionThe basic parallel complexity class is NC.NC is a class of problems computable in poly-logarithmic time (log c n, for a constant c) using a polynomial number of processors.P is a class of problems computable sequentially in a polynomial timeThe main open question in parallel computations isNC = P ?
7PRAM A very reasonable question: Why do we need a PRAM model? . . . . 1PRAM - Parallel Random Access MachineShared-memory multiprocessorunlimited number of processors, eachhas unlimited local memoryknows its IDable to access the sharedmemory in constant timeunlimited shared memoryP123P2...Pi..PnmA very reasonable question: Why do we need a PRAM model?to make it easy to reason about algorithmsto achieve complexity boundsto analyze the maximum parallelism
8PRAM MODEL1P123P2?.Common Memory..Pi..PnmPRAM n RAM processors connected to a common memory of m cellsASSUMPTION: at each time unit each Pi can read a memory cell, make an internalcomputation and write another memory cell.CONSEQUENCE: any pair of processor Pi Pj can communicate in constant time!Pi writes the message in cell x at time tPi reads the message in cell x at time t+1
9Summary of assumptions for PRAM Inputs/Outputs are placed in the shared memory (designated address)Memory cell stores an arbitrarily large integerEach instruction takes unit timeInstructions are synchronized across the processorsPRAM Instruction Setaccumulator architecturememory cell R0 accumulates resultsmultiply/divide instructions take only constant operandsprevents generating exponentially large numbers in polynomial time
10PRAM Complexity Measures for each individual processortime: number of instructions executedspace: number of memory cells accessedPRAM machinetime: time taken by the longest running processorhardware: maximum number of active processors
11Two Technical Issues for PRAM How processors are activatedHow shared memory is accessed
12Processor ActivationP0 places the number of processors (p) in the designated shared-memory celleach active Pi, where i < p, starts executingO(1) time to activateall processors halt when P0 haltsActive processors explicitly activate additional processors via FORK instructionstree-like activationO(log p) time to activatep...1i processor will activate a processor 2i and a processor 2i+1
13PRAM Too many interconnections gives problems with synchronization However it is the best conceptual model for designing efficient parallel algorithmsdue to simplicity and possibility of simulating efficiently PRAM algorithms on more realistic parallel architecturesBasic parallel statementfor all x in X do in parallelinstruction (x)For each x PRAM will assign a processor which will execute instruction(x)
14Shared-Memory AccessConcurrent (C) means, many processors can do the operation simultaneously in the same memoryExclusive (E) not concurentEREW (Exclusive Read Exclusive Write)CREW (Concurrent Read Exclusive Write)Many processors can read simultaneously the same location, but only one can attempt to write to a given locationERCW (Exclusive Read Concurrent Write)CRCW (Concurrent Read Concurrent Write)Many processors can write/read at/from the same memory location
15Concurrent Write (CW) What value gets written finally? Priority CW – processors have priority based on which write value is decidedCommon CW – multiple processors can simultaneously write only if values are the sameArbitrary/Random CW – any one of the values are randomly chosen
16Example CRCW-PRAM Initially table A contains values 0 and 1 output contains value 0The program computes the “Boolean OR” ofA, A, A, A, A
17Example CREW-PRAMAssume initially table A contains [0,0,0,0,0,1] and we have the parallel program
19Membership problem p processors PRAM with n numbers (p ≤ n) Does x exist within the n numbers?P0 contains x and finally P0 has to knowAlgorithmstep1: Inform everyone what x isstep2: Every processor checks [n/p] numbers and sets a flagstep3: Check if any of the flags are set to 1
20One more time about PRAM model N synchronized processorsShared memoryEREW, ERCW,CREW, CRCWConstant timeaccess to the memorystandard multiplication/additionCommunication(implemented via access to shared memory)
21Two problems for PRAM Problem 1. Min of n numbers Problem 2. Computing a position of the first one in the sequence of 0’s and 1’s.How fast we can compute with many processor and how to reduce the number of processors?
22? Min of n numbers Input: Given an array A with n numbers Output: the minimal number in an array ACost = 1 n?Sequential vs ParallelOptimalPar.Cost = O(n)Sequential algorithm…At least n comparisons should be performed!!!COST = (num. of processors) (time)
23Mission: Impossible … computing in a constant time Archimedes: Give me a lever long enough and a place to stand and I will move the earthNOWDAYS….Give me a parallel machine with enough processors and I will find the smallest number in any giant set in a constant time!
24Parallel solution 1 Min of n numbers Comparisons between numbers can be done independentlyThe second part is to find the result using concurrent write modeFor n numbers ----> we have ~ n2 pairs[a1,a2,a3,a4](a1,a2)(a2, a3)(a3, a4)(a2, a4)(a1, a3)(a1, a4)1(ai ,aj)If ai > aj then ai cannot be the minimal numbernij1M[1..n]
25for each 1 i n do in parallel M[i]:=0 The following program computes MIN of n numbers stored in the array C[1..n] in O(1) time with n2 processors.Algorithm A1for each 1 i n do in parallelM[i]:=0for each 1 i,j n do in parallelif ij C[i] C[j] then M[j]:=1if M[i]=0 then output:=i
26From n2 processors to n1+1/2 A1A1Step 1: Partition into disjoint blocks of size Step 2: Apply A1 to each block Step 3: Apply A1 to the results from the step 2
27From n1+1/2 processors to n1+1/4 A2A2Step 1: Partition into disjoint blocks of size Step 2: Apply A2 to each block Step 3: Apply A2 to the results from the step 2
282. Partition the input array C of size n into disjoint n2 -> n1+1/2 -> n1+1/4 -> n1+1/8 -> n1+1/16 ->… -> n1+1/k ~ n1Assume that we have an algorithm Ak working in O(1) time with processorsAlgorithm Ak+11.Let =1/22. Partition the input array C of size n into disjointblocks of size n each3. Apply in parallel algorithm Ak to each of these blocks4. Apply algorithm Ak to the array C’ consisting of n/ nminima in the blocks.
29ComplexityWe can compute minimum of n numbers using CRCW PRAM model in O(log log n) with n processors by applying a strategy of partitioning the inputParCost = n log log n
30Mission: Impossible (Part 2) Computing a position of the first one in the sequence of 0’s and 1’s in a constant time.
31Problem 2. Computing a position of the first one in the sequence of 0’s and 1’s. Algorithm A(2 parallel steps and n2 processors)for each 1 i<j n do in parallelif C[i] =1 and C[j]=1 then C[j]:=0for each 1 i n do in parallelif C[i] =1 then FIRST-ONE-POSITION:=i111FIRST-ONE-POSITION(C)=4 for the input arrayC=[0,0,0,1,0,0,0,1,1,1,0,0,0,1]After the first parallel step C will contain a single element 1
32Reducing number of processors Algorithm B –it reports if there is any one in the table.There-is-one:=0for each 1 i n do in parallelif C[i] =1 then There-is-one:=1111
33Now we can merge two algorithms A and B Partition table C into segments of sizeIn each segment apply the algorithm BFind position of the first one in these sequence by applying algorithm AApply algorithm A to this single segment and compute the final valueBAA
34ComplexityWe apply an algorithm A twice and each time to the array of lengthwhich need only ( )2 = n processorsThe time is O(1) and number of processors is n.
35Tractable and intractable problems for parallel computers
36P (complexity)In computational complexity theory, P is the complexity class containing decision problems which can be solved by a deterministic Turing machine using a polynomial amount of computation time, or polynomial time.P is known to contain many natural problems, including linear programming, calculating the greatest common divisor, and finding a maximum matching.In 2002, it was shown that the problem of determining if a number is prime is in P.
37P-complete classIn complexity theory, the complexity class P-complete is a set of decision problems and is useful in the analysis of which problems can be efficiently solved on parallel computers.A decision problem is in P-complete if it is complete for P, meaning that it is in P, and that every problem in P can be reduced to it in polylogarithmic time on a parallel computer with a polynomial number of processors.In other words, a problem A is in P-complete if, for each problem B in P, there are constants c and k such that B can be reduced to A in time O((log n)c) using O(nk) parallel processors.
38MotivationThe class P, typically taken to consist of all the "tractable" problems for a sequential computer, contains the class NC, which consists of those problems which can be efficiently solved on a parallel computer. This is because parallel computers can be simulated on a sequential machine.It is not known whether NC=P. In other words, it is not known whether there are any tractable problems that are inherently sequential.Just as it is widely suspected that P does not equal NP, so it is widely suspected that NC does not equal P.
39P-complete problems The most basic P-complete problem is this: Given a Turing machine, an input for that machine, and a number T (written in unary), does that machine halt on that input within the first T steps?It is clear that this problem is P-complete: if we can parallelize a general simulation of a sequential computer, then we will be able to parallelize any program that runs on that computer.If this problem is in NC, then so is every other problem in P.
40This problem illustrates a common trick in the theory of P-completeness. We aren't really interested in whether a problem can be solved quickly on a parallel machine.We're just interested in whether a parallel machine solves it much more quickly than a sequential machine. Therefore, we have to reword the problem so that the sequential version is in P. That is why this problem required T to be written in unary.If a number T is written as a binary number (a string of n ones and zeros, where n=log(T)), then the obvious sequential algorithm can take time 2n. On the other hand, if T is written as a unary number (a string of n ones, where n=T), then it only takes time n. By writing T in unary rather than binary, we have reduced the obvious sequential algorithm from exponential time to linear time. That puts the sequential problem in P. Then, it will be in NC if and only if it is parallelizable.
41P-complete problemsMany other problems have been proved to be P-complete, and therefore are widely believed to be inherently sequential. These include the following problems, either as given, or in a decision-problem form:In order to prove that a given problem is P-complete, one typically tries to reduce a known P-complete problem to the given one, using an efficient parallel algorithm.
42Examples of P-complete problems Circuit Value Problem (CVP) - Given a circuit, the inputs to the circuit, and one gate in the circuit, calculate the output of that gateGame of Life - Given an initial configuration of Conway's Game of Life, a particular cell, and a time T (in unary), is that cell alive after T steps?Depth First Search Ordering - Given a graph with fixed ordered adjacency lists, and nodes u and v, is vertex u visited before vertex v in a depth-first search?
43Problems not known to be P-complete Some problems are not known to be either NP-complete or P. These problems (e.g. factoring) are suspected to be difficult.Similarly there are problems that are not known to be either P-complete or NC, but are thought to be difficult to parallelize.Examples include the decision problem forms of finding the greatest common divisor of two binary numbers, and determining what answer the extended Euclidean algorithm would return when given two binary numbers.
44ConclusionJust as the class P can be thought of as the tractable problems, so NC can be thought of as the problems that can be efficiently solved on a parallel computer.NC is a subset of P because parallel computers can be simulated by sequential ones.It is unknown whether NC = P, but most researchers suspect this to be false, meaning that there are some tractable problems which are probably "inherently sequential" and cannot significantly be sped up by using parallelismThe class P-Complete can be thought of as "probably not parallelizable" or "probably inherently sequential".