Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin ( www.cse.psu.edu/~mji.

Similar presentations


Presentation on theme: "CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin ( www.cse.psu.edu/~mji."— Presentation transcript:

1 CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin ( www.cse.psu.edu/~mji )www.cse.psu.edu/~mji www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

2 CSE431 L25 MultiIntro.2Irwin, PSU, 2005 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor – multiple processors with a single shared address space  Cluster – multiple computers (each with their own address space) connected over a local area network (LAN) functioning as a single system

3 CSE431 L25 MultiIntro.3Irwin, PSU, 2005 Applications Needing “Supercomputing”  Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration)  DoE stockpile stewardship (to ensure the safety and reliability of the nation’s stockpile of nuclear weapons)  Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks)  Transportation (improving vehicles’ airflow dynamics, fuel consumption, crashworthiness, noise reduction)  Bioinformatics and computational biology (genomics, protein folding, designer drugs)  Societal health and safety (pollution reduction, disaster planning, terrorist action detection) http://www.nap.edu/books/0309095026/html/

4 CSE431 L25 MultiIntro.4Irwin, PSU, 2005 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = ---------------------- Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  Speedup w/ E =

5 CSE431 L25 MultiIntro.5Irwin, PSU, 2005 Encountering Amdahl’s Law  Speedup due to enhancement E is Speedup w/ E = ---------------------- Exec time w/o E Exec time w/ E  Suppose that enhancement E accelerates a fraction F (F 1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E  ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S)

6 CSE431 L25 MultiIntro.6Irwin, PSU, 2005 Examples: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E =  What if its usable only 15% of the time? Speedup w/ E =  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less Speedup w/ E =

7 CSE431 L25 MultiIntro.7Irwin, PSU, 2005 Examples: Amdahl’s Law  Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(.75 +.25/20) = 1.31  What if its usable only 15% of the time? Speedup w/ E = 1/(.85 +.15/20) = 1.17  Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar!  To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less Speedup w/ E = 1 / ((1-F) + F/S)

8 CSE431 L25 MultiIntro.8Irwin, PSU, 2005 Supercomputer Style Migration (Top500)  In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80% Nov data http://www.top500.org/lists/2005/11/ Cluster – whole computers interconnected using their I/O bus Constellation – a cluster that uses an SMP multiprocessor as the building block

9 CSE431 L25 MultiIntro.9Irwin, PSU, 2005 Multiprocessor/Clusters Key Questions  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors can be supported?

10 CSE431 L25 MultiIntro.10Irwin, PSU, 2005 Flynn’s Classification Scheme  Now obsolete except for...  SISD – single instruction, single data stream l aka uniprocessor - what we have been talking about all semester  SIMD – single instruction, multiple data streams l single control unit broadcasting operations to multiple datapaths  MISD – multiple instruction, single data l no such machine (although some people put vector machines in this category)  MIMD – multiple instructions, multiple data streams l aka multiprocessors (SMPs, MPPs, clusters, NOWs)

11 CSE431 L25 MultiIntro.11Irwin, PSU, 2005 SIMD Processors  Single control unit  Multiple datapaths (processing elements – PEs) running in parallel l Q1 – PEs are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit l Q2 – Each PE performs the same operation on its own local data PE Control

12 CSE431 L25 MultiIntro.12Irwin, PSU, 2005 Example SIMD Machines MakerYear# PEs# b/ PE Max memory (MB) PE clock (MHz) System BW (MB/s) Illiac IVUIUC197264 1132,560 DAPICL19804,0961252,560 MPPGoodyear198216,384121020,480 CM-2Thinking Machines 198765,5361512716,384 MP-1216MasPar198916,384410242523,000

13 CSE431 L25 MultiIntro.13Irwin, PSU, 2005 Multiprocessor Basic Organizations  Processors connected by a single bus  Processors connected by a network # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36

14 CSE431 L25 MultiIntro.14Irwin, PSU, 2005 Shared Address (Shared Memory) Multi’s  UMAs (uniform memory access) – aka SMP (symmetric multiprocessors) l all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested  NUMAs (nonuniform memory access) l some main memory accesses are faster than others depending on the processor making the request and which location is requested l can scale to larger sizes than UMAs so are potentially higher performance  Q1 – Single address space shared by all the processors  Q2 – Processors coordinate/communicate through shared variables in memory (via loads and stores) l Use of shared data must be coordinated via synchronization primitives (locks)

15 CSE431 L25 MultiIntro.15Irwin, PSU, 2005 N/UMA Remote Memory Access Times (RMAT) YearTypeMax Proc Interconnection Network RMAT (ns) Sun Starfire1996SMP64Address buses, data switch 500 Cray 3TE1996NUMA20482-way 3D torus300 HP V1998SMP328 x 8 crossbar1000 SGI Origin 30001999NUMA512Fat tree500 Compaq AlphaServer GS 1999SMP32Switched bus400 Sun V8802002SMP8Switched bus240 HP Superdome 9000 2003SMP64Switched bus275 NASA Columbia2004NUMA10240Fat tree???

16 CSE431 L25 MultiIntro.16Irwin, PSU, 2005 Single Bus (Shared Address UMA) Multi’s  Caches are used to reduce latency and to lower bus traffic  Must provide hardware to ensure that caches and memory are consistent (cache coherency) – covered in Lecture 26  Must provide a hardware mechanism to support process synchronization – covered in Lecture 26 Processor Cache Single Bus Memory I/O

17 CSE431 L25 MultiIntro.17Irwin, PSU, 2005 Summing 100,000 Numbers on 100 Processors sum[Pn] = 0; for (i = 1000*Pn; i< 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];  Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable)  The processors then coordinate in adding together the partial sums ( half is a private variable initialized to 100 (the number of processors)) repeat synch();/*synchronize first if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; half = half/2 if (Pn<half) sum[Pn] = sum[Pn] + sum[Pn+half] until (half == 1);/*final sum in sum[0]

18 CSE431 L25 MultiIntro.18Irwin, PSU, 2005 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] half = 10

19 CSE431 L25 MultiIntro.19Irwin, PSU, 2005 An Example with 10 Processors P0P1P2P3P4P5P6P7P8P9 sum[P0]sum[P1]sum[P2]sum[P3]sum[P4]sum[P5]sum[P6]sum[P7]sum[P8]sum[P9] P0 P1P2P3P4 half = 10 half = 5 P1 half = 2 P0 half = 1

20 CSE431 L25 MultiIntro.20Irwin, PSU, 2005 Message Passing Multiprocessors  Each processor has its own private address space  Q1 – Processors share data by explicitly sending and receiving information (messages)  Q2 – Coordination is built into message passing primitives (send and receive)

21 CSE431 L25 MultiIntro.21Irwin, PSU, 2005 Summary  Flynn’s classification of processors – SISD, SIMD, MIMD l Q1 – How do processors share data? l Q2 – How do processors coordinate their activity? l Q3 – How scalable is the architecture (what is the maximum number of processors)?  Shared address multis – UMAs and NUMAs  Bus connected (shared address UMAs) multis l Cache coherency hardware to ensure data consistency l Synchronization primitives for synchronization l Bus traffic limits scalability of architecture (< ~ 36 processors)  Message passing multis

22 CSE431 L25 MultiIntro.22Irwin, PSU, 2005 Next Lecture and Reminders  Next lecture -Reading assignment – PH 9.3  Reminders l HW5 (and last) due Dec 1 st (Part 2), Dec 6 th (Part 1) l Check grade posting on-line (by your midterm exam number) for correctness l Final exam (no conflicts scheduled) -Tuesday, December 13th, 2:30-4:20, 22 Deike


Download ppt "CSE431 L25 MultiIntro.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 25. Intro to Multiprocessors Mary Jane Irwin ( www.cse.psu.edu/~mji."

Similar presentations


Ads by Google