Parallel Computing and Parallel Computers. Home work assignment 1. Write few paragraphs (max two page) about yourself. Currently what going on in your.

Parallel Computing and Parallel Computers

Home work assignment 1. Write few paragraphs (max two page) about yourself. Currently what going on in your life, what you like and dislike, your family etc. You should have one paragraph about your expectation from this course. Due date: 09/04/09 (Friday in class).

Parallel Computing Using more than one computer, or a computer with more than one processor, to solve a problem. Motives Usually faster computation - very simple idea - that n computers operating simultaneously can achieve the result n times faster - it will not be n times faster for various reasons. Other motives include: fault tolerance, larger amount of memory available,...

Demand for Computational Speed Continual demand for greater computational speed from a computer system than is currently possible Areas requiring great computational speed include: –Numerical modeling –Simulation of scientific and engineering problems. Computations need to be completed within a “reasonable” time period.

Grand Challenge Problems Ones that cannot be solved in a reasonable amount of time with today’s computers. Obviously, an execution time of 10 years is always unreasonable. Examples Modeling large DNA structures Global weather forecasting Modeling motion of astronomical bodies.

Weather Forecasting Atmosphere modeled by dividing it into 3- dimensional cells. Calculations of each cell repeated many times to model passage of time.

Global Weather Forecasting Example Suppose whole global atmosphere divided into cells of size 1 mile  1 mile  1 mile to a height of 10 miles (10 cells high) - about 5  10 8 cells. Suppose each calculation requires 200 floating point operations. In one time step, 10 11 floating point operations necessary. To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (10 9 floating point operations/s) takes 10 6 seconds or over 10 days. To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4  10 12 floating point operations/sec).

Modeling Motion of Astronomical Bodies Each body attracted to each other body by gravitational forces. Movement of each body predicted by calculating total force on each body. With N bodies, N - 1 forces to calculate for each body, or approx. N 2 calculations. (N log 2 N for an efficient approx. algorithm.) After determining new positions of bodies, calculations repeated.

N-body A galaxy might have, say, 10 11 stars. Even if each calculation done in 1 microsecond (optimistic figure), it takes 10 9 years for one iteration using N 2 algorithm and almost a year for one iteration using an efficient N log 2 N approximate algorithm.

Parallel programming – programming parallel computers –Has been around for more than 40 years.

Types of Parallel Computers Two principal types: Shared memory multiprocessor Distributed memory multicomputer

Shared Memory Multiprocessor

Conventional Computer Consists of a processor executing a program stored in a (main) memory: Each main memory location located by its address. Addresses start at 0 and extend to 2 b - 1 when there are b bits (binary digits) in address. Main memory Processor Instructions (to processor) Data (to or from processor)

Shared Memory Multiprocessor System Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module : Processors Processor-memory Interconnections Memory module One address space

Simplistic view of a small shared memory multiprocessor Examples: Dual Pentiums Quad Pentiums ProcessorsShared memory Bus

Real computer system have cache memory between the main memory and processors. Level 1 (L1) cache and Level 2 (L2) cache. Example Quad Shared Memory Multiprocessor Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Memory controller Memory I/O interface I/O bus Processor/ memory bus Shared memory

“Recent” innovation Dual-core and multi-core processors –Two or more independent processors in one package Actually an old idea but not put into wide practice until recently. Since L1 cache is usually inside package and L2 cache outside package, dual-/multi-core processors usually share L2 cache.

1b.18 Single quad core shared memory multiprocessor L2 Cache Memory controller Memory Shared memory Chip Processor L1 cache Processor L1 cache Processor L1 cache Processor L1 cache

1b.19 Examples Intel: –Core Dual processors -- Two processors in one package sharing a common L2 Cache. 2005-2006 –Intel Core 2 family dual cores, with quad core from Nov 2006 onwards –Core i7 processors replacing Core 2 family - Quad core Nov 2008 –Intel Teraflops Research Chip (Polaris), a 3.16 GHz, 80- core processor prototype. Xbox 360 game console -- triple core PowerPC microprocessor. PlayStation 3 Cell processor -- 9 core design. References and more information: http://www.intel.com/products/processor/core2duo/index.htm http://en.wikipedia.org/wiki/Dual_core

1b.20 Multiple quad-core multiprocessors Memory controller Memory Shared memory L2 Cache possible L3 cache Processor L1 cache Processor L1 cache Processor L1 cache Processor L1 cache Processor L1 cache Processor L1 cache Processor L1 cache Processor L1 cache

Programming Shared Memory Multiprocessors Several possible ways 1.Use Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access shared variables declared outside threads. Example Pthreads

2.Use library functions and preprocessor compiler directives with a sequential programming language to declare shared variables and specify parallelism. Example OpenMP - industry standard. Consists of library functions, compiler directives, and environment variables - needs OpenMP compiler

3.Use a modified sequential programming language -- added syntax to declare shared variables and specify parallelism. Example UPC (Unified Parallel C) - needs a UPC compiler.

4.Use a specially designed parallel programming language -- with syntax to express parallelism. Compiler automatically creates executable code for each processor (not common now).

5.Use a regular sequential programming language such as C and ask parallelizing compiler to convert it into parallel executable code. Also not common now.

Message-Passing Multicomputer Complete computers connected through an interconnection network: Processor Interconnection network Local Computers Messages memory

Interconnection Networks Limited and exhaustive interconnections 2- and 3-dimensional meshes Hypercube (not common) Using Switches: –Crossbar –Trees –Multistage interconnection networks

Two-dimensional array (mesh) Also three-dimensional - used in some large high performance systems. Links Computer/ processor

Three-dimensional hypercube

Four-dimensional hypercube

Hypercube routing To go from node x 1 x 2 x 3 to node y 1 y 2 y 3 Go to y 1 x 2 x 3 if y 1 and x 1 different Then go to y 1 y 2 x 3 if y 2 and x 2 different Then go to y 1 y 2 y 3 if y 3 and x 3 different Example 101 to 010 101 -> 001 -> 011->010

Question What is the maximum number of steps to route a message in an n-dimensional hypercube?

Crossbar switch Switches Processors Memories

Tree Switch element Root Links Processors

Multistage Interconnection Network Example: Omega network 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 Inputs Outputs 2x 2 switch elements (straight-through or crossover connections)

Routing Destination address gives which way to go, starting with (say) most significant bit: –If bit in destination address is a 1 take top switch output –If bit in destination address is a 0, take bottom switch output

Networked Computers as a Computing Platform A network of computers became a very attractive alternative to expensive supercomputers and parallel computer systems for high-performance computing in early 1990’s. Several early projects. Notable: – Berkeley NOW (network of workstations) project. –NASA Beowulf project.

Key advantages: Very high performance workstations and PCs readily available at low cost. The latest processors can easily be incorporated into the system as they become available. Existing software can be used or modified.

Beowulf Clusters A group of interconnected “commodity” computers achieving high performance with low cost. Typically using commodity interconnects - high speed Ethernet, and Linux OS. * Beowulf comes from name given by NASA Goddard Space Flight Center cluster project.

Cluster Interconnects Originally fast Ethernet on low cost clusters Gigabit Ethernet - easy upgrade path More Specialized/Higher Performance Myrinet - 2.4 Gbits/sec - disadvantage: single vendor cLan SCI (Scalable Coherent Interface) QNet Infiniband - may be important as infininband interfaces may be integrated on next generation PCs

Dedicated cluster with a master node and compute nodes User Master node Compute nodes Dedicated Cluster Ethernet interface Switch External network Computers Local network

Software Tools for Clusters Based upon Message Passing Parallel Programming: Parallel Virtual Machine (PVM) - developed in late 1980’s. Became very popular. Message-Passing Interface (MPI) - standard defined in 1990s. Both provide a set of user-level libraries for message passing. Use with regular programming languages (C, C++,...).

Flynn Classifications Flynn (1966) classification of computer architectures –Based upon the number of concurrent instruction and data streams available in the architecture: Single Instruction, Single Data stream-SISD Single Instruction, Multiple Data streams-SIMD Multiple Instruction, Single Data stream-MISD Multiple Instruction, Multiple Data stream-MISD

Performance of Parallel Algorithm

Performance Metrics Execution time Speed up Efficiency Cost Parallel Overhead` Scalability 1a.45

Speedup Factor where t s is execution time on a single processor and t p is execution time on a multiprocessor. S(p,n) gives increase in speed by using multiprocessor. The speedup is defined with respect to the best sequential algorithm! If it were defined via some t s, any speedup could be achieved by comparing against a slow sequential algorithm. Underlying algorithm for parallel implementation might be (and is usually) different. S(p,n) = Execution time using one processor (best sequential algorithm) Execution time using a multiprocessor with p processors t s(best) tptp

Speedup factor can also be cast in terms of computational steps: Can also extend time complexity to parallel computations. S(p,n) = Number of computational steps using one processor Number of parallel computational steps with p processors

Speedup Bound The speedup S(p,n) fulfils the inequality S <= p Proof –Suppose we have S(p,n) > p, then simulating the parallel algorithm on one processor would need time pT P (p,n) and PT P (p,n) = P ts/S(p,n) < ts( best) which is a contradiction to the denition of ts( best) 1a.48

Maximum Speedup Maximum speedup is usually p with p processors (linear speedup). Possible to get superlinear speedup (greater than p) but usually a specific reason such as: Extra memory in multiprocessor system Nondeterministic algorithm

Superlinear Speedup example - Searching (a) Searching each sub-space sequentially t s t s /p StartTime  t Solution found xt s /p Sub-space search x indeterminate

(b) Searching each sub-space in parallel Solution found  t

Speed-up then given by S(p)S(p) x t s p  t  + t  =

Worst case for sequential search when solution found in last sub-space search. Then parallel version offers greatest benefit, i.e. S(p)S(p) p1– p t s t  +  t   = as  t tends to zero

Least advantage for parallel version when solution found in first sub-space search of the sequential search, i.e. Actual speed-up depends upon which subspace holds solution but could be extremely large. S(p) = t  t  = 1

Reasons for non-optimal speedup Following are the main reasons for non- optimal speedup: –Load imbalance: Not every processor has the same amount of work to do. –Sequential part: Not all operations of the sequential algorithm can be processes in parallel. –Communication overhead: Depends relative cost of computation and communication –Additional operations: The optimal sequential algorithm cannot be parallelized directly and must be replaced by a slower but parallelizable algorithm.

Cost The cost of a parallel algorithm is defined as C(p,n) = pt p (p,n): In contrast to the previous numbers it is not dimensionless An algorithm is called cost optimal if its cost is proportional to t s (best). –Then its efficiency E(p,n) = t s (best)=C(p,n) is constant

Efficiency The efficiency of a parallel algorithm is defined as E = t s /pt p Efficiency is bounded by the following equation –E <= 1 Interpretation: E.p processors are effectively working on the solution of the problem, (1 - E).P are overhead. 1a.57

Overhead Total Parallel overhead –Total time collectively spent by all the processing elements over and above that required by fastest known sequential algorithm for solving the same problem. t o = pt p - t s –Overhead is an increasing function of p because of the serial component of the problem. 1a.58

Scalability Scalability is the ability of a parallel algorithm to use an increasing number of processors: How does S(p, n) behave with p? S(p,n) has two arguments. What about n? We will consider several different choices. –Fixed size scaling –Gustafson scaling –Iso-efficient scaling 1a.59

Fixed Size Scaling Very simple: Choose problem size (n) to be fixed. Equivalently, we can fix the sequential execution time. Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized. 1a.60

Maximum Speedup Amdahl’s law Serial section Parallelizable sections (a) One processor (b) Multiple processors ft s (1- f)t s t s - f)t s /p t p p processors

1a.62 Speedup factor is given by: This equation is known as Amdahl’s law

Speedup against number of processors 4 8 12 16 20 48121620 f = 20% f = 10% f = 5% f = 0% Number of processors,p

Even with infinite number of processors, maximum speedup limited to 1/f. Example With only 5% of computation being serial, maximum speedup is 20, irrespective of number of processors.

Gustafson Scaling Amdahl's law is based on fixed problem size. Gustafson assumed that the parallel execution time is fixed – as the system size is increased the problem size is also increased. The resulting speed up factor is numerically different from Amdahl’s speed up factor and is called scaled speed up. 1a.65

Parallel Computing and Parallel Computers. Home work assignment 1. Write few paragraphs (max two page) about yourself. Currently what going on in your.

Similar presentations

Presentation on theme: "Parallel Computing and Parallel Computers. Home work assignment 1. Write few paragraphs (max two page) about yourself. Currently what going on in your."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing and Parallel Computers. Home work assignment 1. Write few paragraphs (max two page) about yourself. Currently what going on in your.

Similar presentations

Presentation on theme: "Parallel Computing and Parallel Computers. Home work assignment 1. Write few paragraphs (max two page) about yourself. Currently what going on in your."— Presentation transcript:

Similar presentations

About project

Feedback